Module 3

Introduction to Large Language Models

Author

Yike Zhang

Published

June 1, 2026

Here, we turn text into numbers a model can learn from, and introduces the machinery behind large language models. Week 8 steps back to ask what these systems do to people, including a discussion on fairness, ethics and accountability.

import numpy as np
import pandas as pd

pd.set_option("display.max_rows", 12)

Week 7: NLP and Feature Engineering

A model cannot read a sentence. It needs a fixed-length vector of numbers. The whole first half of Week 7 is about that translation, from raw text to a numeric matrix, and the second half is about cleaning up numeric features so they are ready for a model.

Tokenization

The first step in any text pipeline is tokenization: chopping a document into the units (usually words) the rest of the pipeline counts. A real tokenizer handles punctuation, casing, and edge cases, but the core idea is just “lowercase and split”, which you can write in one line.

import re

def tokenize(text):
    text = text.lower()
    return re.findall(r"[a-z]+", text)   # keep runs of letters, drop punctuation/numbers

tokenize("All my cats, in a row!")
['all', 'my', 'cats', 'in', 'a', 'row']

Bag of Words

The simplest text-to-vector scheme is Bag of Words. First learn a vocabulary of every word in the corpus, then represent each document by how many times each vocabulary word appears in it. Order is thrown away, which is why it is a “bag”. Rather than build it by hand, we use scikit-learn’s CountVectorizer, the tool you would actually reach for.

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "All my cats in a row",
    "When my cat sits down she looks like a Furby toy",
    "The cat and the cats are friends",
]

vec = CountVectorizer()
counts = vec.fit_transform(corpus)   # learn the vocabulary AND encode in one call

bow = pd.DataFrame(counts.toarray(), columns=vec.get_feature_names_out())
bow
all and are cat cats down friends furby in like looks my row she sits the toy when
0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0
1 0 0 0 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1
2 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 2 0 0

Each row is one document, each column is one vocabulary word, and the entries are counts. Notice the matrix is mostly zeros: any one short document uses only a few of the corpus’s words. Real text vectors are extremely sparse, which is why scikit-learn stores them in a compressed form and only expands to a dense array when we ask.

TF-IDF

Raw counts over-reward words that are simply common everywhere, like “the”. TF-IDF fixes that by multiplying each term frequency by an inverse document frequency: a word that appears in every document gets down-weighted, while a word that is rare across the corpus but frequent in one document gets boosted. The formula the lecture gives is \(\text{tf-idf}(t, d) = \text{tf}(t, d) \times \log\!\frac{n}{1 + \text{df}(t)}\), and scikit-learn computes it with the same interface as before.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
weights = tfidf.fit_transform(corpus)

pd.DataFrame(weights.toarray().round(2),
             columns=tfidf.get_feature_names_out())
all and are cat cats down friends furby in like looks my row she sits the toy when
0 0.49 0.00 0.00 0.00 0.37 0.00 0.00 0.00 0.49 0.00 0.00 0.37 0.49 0.00 0.00 0.0 0.00 0.00
1 0.00 0.00 0.00 0.25 0.00 0.33 0.00 0.33 0.00 0.33 0.33 0.25 0.00 0.33 0.33 0.0 0.33 0.33
2 0.00 0.35 0.35 0.27 0.27 0.00 0.35 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.7 0.00 0.00

Compare the “cat” and “the” columns against the plain counts above. Words spread across many documents are pulled down, distinctive words stand out. That reweighting is what makes TF-IDF a stronger feature set than raw counts for tasks like spam detection or sentiment analysis.

Encoding categorical features

Text is not the only thing that needs numbers. A categorical column like a color or a size has to be encoded too, and how you encode depends on whether the categories have an order. For nominal categories (no order, like color) the safe choice is one-hot encoding: one 0/1 column per category, so the model never invents a false “red < green” relationship.

sizes = pd.DataFrame({"color": ["red", "green", "blue", "green", "red"]})
pd.get_dummies(sizes, columns=["color"])
color_blue color_green color_red
0 False False True
1 False True False
2 True False False
3 False True False
4 False False True

For ordinal categories the order is real and worth keeping, so a single integer code that respects the ranking is appropriate.

order = pd.Categorical(["M", "S", "L", "XL", "M"],
                       categories=["S", "M", "L", "XL"], ordered=True)
pd.Series(order).cat.codes   # S=0, M=1, L=2, XL=3, the order is preserved
0    1
1    0
2    2
3    3
4    1
dtype: int8

Filling missing numeric values

Week 4 met missing data in cleaning. Here we handle it the way a modeling pipeline does, with scikit-learn’s SimpleImputer, because the same imputer can later be applied to new data automatically. The cardinal rule from the lecture: learn the fill value on the training data only, then apply it to both training and test, so you never peek at data you are meant to predict.

from sklearn.impute import SimpleImputer

features = pd.DataFrame({
    "age": [25, np.nan, 32, 41, np.nan],
    "income": [50000, 62000, np.nan, 88000, 54000],
})

imputer = SimpleImputer(strategy="mean")
filled = imputer.fit_transform(features)   # fit learns each column's mean, transform fills
pd.DataFrame(filled, columns=features.columns).round(0)
age income
0 25.0 50000.0
1 33.0 62000.0
2 32.0 63500.0
3 41.0 88000.0
4 33.0 54000.0

Scaling numeric features

Features often live on wildly different scales: an age near 30 and an income near 60,000. Some algorithms (k-nearest-neighbors, linear models, neural nets) are thrown off when one feature dominates simply because its numbers are bigger. Two standard fixes. Min-max scaling squeezes each column into the range 0 to 1. Standardization (the z-score) recenters each column to mean 0 and standard deviation 1, and it is the safer default because it is less sensitive to a single extreme value.

from sklearn.preprocessing import MinMaxScaler, StandardScaler

complete = pd.DataFrame(filled, columns=features.columns)

minmax = pd.DataFrame(MinMaxScaler().fit_transform(complete),
                      columns=["age_mm", "income_mm"])
standard = pd.DataFrame(StandardScaler().fit_transform(complete),
                        columns=["age_z", "income_z"])
pd.concat([minmax.round(2), standard.round(2)], axis=1)
age_mm income_mm age_z income_z
0 0.00 0.00 -1.51 -1.02
1 0.48 0.32 0.00 -0.11
2 0.44 0.36 -0.13 0.00
3 1.00 1.00 1.64 1.85
4 0.48 0.11 0.00 -0.72
NoteWhich models even need this?

Tree-based models (decision trees, random forests) are scale-invariant: they split on thresholds, so multiplying a column by 1000 changes nothing. Distance- and weight-based models (KNN, linear and logistic regression, neural networks) are not, and they need scaling to behave. When in doubt, standardize; it rarely hurts and often helps.

Inside a large language model

The lecture’s headline is that a large language model does one deceptively simple thing: predict the next token. Given a prefix of text, it produces a score (a logit) for every word in its vocabulary, then turns those scores into probabilities with the softmax function. We do not have a real model on this page, but the math is small enough to do by hand on a handful of mock logits, and seeing it run demystifies the whole pipeline.

def softmax(logits):
    logits = np.asarray(logits, dtype=float)
    shifted = logits - logits.max()        # subtract the max for numerical stability
    exp = np.exp(shifted)
    return exp / exp.sum()

vocab = ["cat", "dog", "sat", "ran", "the"]
logits = [2.0, 1.0, 0.5, 0.2, 3.1]         # the model's raw scores for the next word

probs = softmax(logits)
pd.DataFrame({"token": vocab, "logit": logits, "probability": probs.round(3)})
token logit probability
0 cat 2.0 0.210
1 dog 1.0 0.077
2 sat 0.5 0.047
3 ran 0.2 0.035
4 the 3.1 0.631

The probabilities sum to 1 and the highest logit (“the”) gets the largest share. That is the model’s belief about the next word.

Temperature controls how adventurous the model is

To actually pick a token from that distribution, we sample. The dial that shapes the sampling is temperature: divide the logits by a temperature \(\tau\) before the softmax. A low temperature sharpens the distribution toward the top choice (predictable, good for code and facts), while a high temperature flattens it so rarer words get a real chance (creative, but riskier).

def softmax_with_temperature(logits, tau):
    return softmax(np.asarray(logits, dtype=float) / tau)

cool = softmax_with_temperature(logits, tau=0.5)   # sharper, more confident
hot = softmax_with_temperature(logits, tau=2.0)    # flatter, more exploratory

pd.DataFrame({
    "token": vocab,
    "tau=0.5": cool.round(3),
    "tau=1.0": probs.round(3),
    "tau=2.0": hot.round(3),
})
token tau=0.5 tau=1.0 tau=2.0
0 cat 0.098 0.210 0.237
1 dog 0.013 0.077 0.144
2 sat 0.005 0.047 0.112
3 ran 0.003 0.035 0.096
4 the 0.882 0.631 0.411

Read across a row. At \(\tau = 0.5\) the top token’s probability climbs and the long-shots fade; at \(\tau = 2.0\) the gap narrows and the rare words become plausible. Same model, same scores, very different personalities.

Greedy vs random sampling

With the probabilities in hand, greedy decoding always takes the single most likely token, which is repetitive but safe. Random sampling draws according to the probabilities, which is more varied and human. We can demonstrate both, and because our generator is seeded the sampled result is reproducible.

rng = np.random.default_rng(seed=11)

greedy_token = vocab[int(np.argmax(probs))]
sampled_tokens = rng.choice(vocab, size=8, p=probs)   # draw 8 times by probability

print("greedy choice:", greedy_token)
print("random samples:", list(sampled_tokens))
greedy choice: the
random samples: [np.str_('cat'), np.str_('the'), np.str_('the'), np.str_('cat'), np.str_('cat'), np.str_('the'), np.str_('cat'), np.str_('cat')]

Measuring a language model: perplexity

How good is a language model? One core metric is perplexity, the exponential of the average cross-entropy loss. Cross-entropy is just the negative log-probability the model assigned to the words that actually came next, so a model that confidently predicts the truth has low loss and low perplexity. Lower is better.

# Suppose these are the probabilities a model gave to the TRUE next word
# at each of five positions in a sentence.
true_word_probs = np.array([0.5, 0.2, 0.8, 0.1, 0.4])

cross_entropy = -np.log(true_word_probs).mean()
perplexity = np.exp(cross_entropy)

print(f"average cross-entropy: {cross_entropy:.3f}")
print(f"perplexity:            {perplexity:.3f}")
average cross-entropy: 1.149
perplexity:            3.155

A perplexity of roughly 3.6 means the model was about as uncertain, on average, as if it had been choosing uniformly among 3.6 words at each step. Train a better model and that number drops.

What a real model call looks like

Everything above ran on mock numbers so it works offline. In practice you would call a hosted model through an SDK. The snippet below is illustrative only: it is not executed when this page renders, because it needs network access and credentials. It is here so you recognize the shape of a real call.

# Illustrative: a real API call needs a key and network access, so we do not run it here.
from anthropic import Anthropic

client = Anthropic()  # reads your API key from the environment
message = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=200,
    messages=[{"role": "user", "content": "Explain perplexity in one sentence."}],
)
print(message.content[0].text)

Week 8: Ethics, Fairness, and Accountability

Week 8 shifts focuses from how these systems work to what they do to people. The lecture’s pillars are transparency, accountability, and fairness. Those are not just talking points; some of them are measurable, and a data scientist should know how to measure them. We keep the code small and the focus on the idea.

ImportantThe three pillars
  • Transparency: can we explain why the system decided what it decided?
  • Accountability: when it causes harm, who is responsible and how is it corrected?
  • Fairness: does it treat groups equitably, or does it bake in and amplify existing disadvantage?

A biased sample produces a biased view

Module 2 warned that a biased sample misleads you no matter how clever the later analysis. Fairness problems often start right there. We build a synthetic hiring dataset where the underlying qualification rate is identical across two groups, then draw a biased sample that over-represents positive outcomes for one group, and watch a naive audit reach the wrong conclusion.

rng = np.random.default_rng(seed=3)
N = 4000

# Ground truth: both groups are equally qualified.
group = rng.choice(["A", "B"], size=N)
qualified = rng.random(N) < 0.5   # 50% in BOTH groups, by construction

population = pd.DataFrame({"group": group, "qualified": qualified})
population.groupby("group")["qualified"].mean().round(3)
group
A    0.497
B    0.503
Name: qualified, dtype: float64

The truth is fair: both groups sit at about 0.50. Now suppose our data collection quietly favored qualified people in group A and unqualified people in group B, the kind of distortion a flawed hiring pipeline introduces.

# Keep a record with a probability that depends on group AND outcome: the bias.
keep_prob = np.where(
    (population["group"] == "A") & population["qualified"], 0.9,
    np.where((population["group"] == "B") & ~population["qualified"], 0.9, 0.3),
)
biased_sample = population[rng.random(N) < keep_prob]
biased_sample.groupby("group")["qualified"].mean().round(3)
group
A    0.767
B    0.257
Name: qualified, dtype: float64

The exact same underlying population now looks as if group A is far more qualified than group B. Nothing about the people changed; only the sampling did. A model trained on this sample would learn and then amplify a gap that does not exist.

Putting a number on the disparity

Fairness audits use concrete metrics so the conversation is not just impressions. A common one is the selection-rate ratio, sometimes called the disparate-impact ratio: the positive rate of the disadvantaged group divided by that of the advantaged group. A widely cited rule of thumb flags anything below 0.8 as evidence of adverse impact.

rates = biased_sample.groupby("group")["qualified"].mean()
ratio = rates.min() / rates.max()

print(f"group selection rates:\n{rates.round(3)}")
print(f"\ndisparate-impact ratio: {ratio:.2f}")
print("flagged for adverse impact" if ratio < 0.8 else "within the 0.8 guideline")
group selection rates:
group
A    0.767
B    0.257
Name: qualified, dtype: float64

disparate-impact ratio: 0.34
flagged for adverse impact

The ratio falls well under 0.8 and trips the flag, even though the true rates were identical. That is the lesson worth carrying out of this course: a metric computed on biased data will confidently certify a bias that the world does not actually contain. The fix is never a fancier model. It is going back to the data collection, the lesson Week 2 opened with and the reason fairness, sampling, and honesty are the same subject.