Module 1

Introduction and Background

Author

Yike Zhang

Published

June 1, 2026

This supplementary material page gets your hands on the keyboard so the rest of the course has somewhere to stand. Read it top to bottom, and when you reach a code block, try to understand it first and predict the output before you read it.

Week 1

Why use a Python notebook?

Almost everything we do this summer happens inside a notebook: a document that mixes prose, code, and the output of that code in one scrollable page. The page you are reading right now is itself a notebook. Each grey block is a code cell, and the result printed underneath it is what Python produced when that cell ran.

The mental model worth keeping is that a notebook runs from top to bottom, and the cells share memory. A variable you define near the top is still available much later on the page. That is convenient, but it also means order matters: if you run cells out of order while experimenting, you can confuse yourself. When in doubt, restart and run everything from the top.

greeting = "Hello, EG4338"
print(greeting)
Hello, EG4338

Notice we did not have to declare a type or end the line with a semicolon. Python keeps the syntax light so you can spend your attention on the data, which is the whole point of the course.

Variables and the basic types

A variable is just a name pointing at a value. The four building-block types you will reach for constantly are integers, floating-point numbers, strings, and booleans.

n_students = 32          # int, a whole number
avg_score = 87.5         # float, a number with a decimal point
course = "Data Science"  # str, text in quotes
is_summer = True         # bool, either True or False

print(type(n_students), type(avg_score), type(course), type(is_summer))
<class 'int'> <class 'float'> <class 'str'> <class 'bool'>

Python figures out the type from the value, which is called dynamic typing. You can always ask what type something is with type(...), and that habit will save you a surprising amount of debugging later when a number is secretly a string.

Arithmetic looks the way you would expect, with one wrinkle worth memorizing: a single slash always gives a float, and a double slash does integer (floor) division.

print(7 / 2)    # true division, always a float
print(7 // 2)   # floor division, drops the remainder
print(7 % 2)    # modulo, the remainder itself
print(2 ** 10)  # exponent, 2 to the 10th
3.5
3
1
1024

Strings and f-strings

Text shows up everywhere in data work, from column names to the messy free-form text we will feed language models in Module 3. The most useful tool for building strings is the f-string: put an f before the opening quote and you can drop variables straight into the text inside curly braces.

name = "Ada"
score = 91.27

print(f"{name} scored {score} points")
print(f"{name} scored {score:.1f} points")   # round to 1 decimal place
print(f"{name.upper()} in {course}")          # call methods right inside the braces
Ada scored 91.27 points
Ada scored 91.3 points
ADA in Data Science

Strings carry a deep toolbox of methods. A few you will use this week and forever after:

raw = "  St. Mary's University  "
print(raw.strip())              # trim leading/trailing whitespace
print(raw.strip().lower())      # chain methods left to right
print("data,science,llm".split(","))  # split on a delimiter into a list
St. Mary's University
st. mary's university
['data', 'science', 'llm']

Lists, indexing, and slicing

A list is an ordered, changeable collection. This is your default container when you have several things and the order matters.

scores = [88, 92, 79, 95, 84]
print(scores[0])     # first element, counting starts at 0
print(scores[-1])    # last element, negative counts from the end
print(len(scores))   # how many elements
88
84
5

Slicing pulls out a run of elements with start:stop. The rule that trips up everyone at first is that the stop index is not included. We will meet this exact rule again in Week 3 when we compare pandas iloc to loc, so it is worth burning in now.

print(scores[1:3])   # positions 1 and 2, NOT 3
print(scores[:2])    # from the start up to (not including) 2
print(scores[2:])    # from position 2 to the end
[92, 79]
[88, 92]
[79, 95, 84]

Lists are happy to grow and change in place.

scores.append(100)        # add to the end
scores[0] = 90            # overwrite a position
print(scores)
print(sorted(scores))     # a new sorted copy, original untouched
[90, 92, 79, 95, 84, 100]
[79, 84, 90, 92, 95, 100]

Dictionaries

A dictionary maps keys to values. Reach for it when each piece of data has a name rather than a position. A row of a spreadsheet is naturally a dictionary: column name to cell value.

student = {"name": "Ada", "score": 91, "major": "CS"}
print(student["name"])          # look up by key
student["year"] = "Junior"      # add a new key
print(student.keys())
print(student.values())
Ada
dict_keys(['name', 'score', 'major', 'year'])
dict_values(['Ada', 91, 'CS', 'Junior'])

Loops and conditionals

A for loop walks through a collection one element at a time. An if statement runs code only when a condition holds. Together they let you make decisions across a dataset.

scores = [88, 92, 79, 95, 84]

for s in scores:
    if s >= 90:
        print(f"{s}: A")
    elif s >= 80:
        print(f"{s}: B")
    else:
        print(f"{s}: needs review")
88: B
92: A
79: needs review
95: A
84: B

When the goal is to build a new list out of an old one, Python offers a tighter form called a list comprehension. It reads almost like the English sentence “the square of x for each x in scores”.

squared = [s ** 2 for s in scores]
passed = [s for s in scores if s >= 85]   # filter while you build
print(squared)
print(passed)
[7744, 8464, 6241, 9025, 7056]
[88, 92, 95]

Functions

A function packages a piece of logic behind a name so you can reuse it and read your own code months later. Define it once with def, then call it as often as you like.

def letter_grade(score):
    """Return a letter grade for a 0-100 score."""
    if score >= 90:
        return "A"
    elif score >= 80:
        return "B"
    elif score >= 70:
        return "C"
    return "F"

print(letter_grade(91))
print([letter_grade(s) for s in scores])
A
['B', 'A', 'C', 'A', 'B']

The triple-quoted line just under def is a docstring. It is optional, but writing one short sentence about what the function does is a habit that pays for itself.

Importing libraries

Python on its own is deliberately small. The real power comes from libraries that other people have written, which you pull in with import. The whole second half of this course leans on a handful of them, and you will see this same import block at the top of nearly every page.

import numpy as np
import pandas as pd

print("numpy", np.__version__)
print("pandas", pd.__version__)
numpy 2.4.6
pandas 3.0.3

The as np part gives the library a short nickname so you do not have to type the full name every time. The nicknames np for numpy and pd for pandas are universal conventions. Use them, and any data scientist will instantly recognize your code.

A first taste of data

Let us close with a 30-second preview of where Module 2 is headed. NumPy gives us fast arrays of numbers, and pandas gives us the DataFrame: a table with named columns, which is the single most important object in the course.

import numpy as np
import pandas as pd

# A reproducible random-number generator. The seed makes the "random"
# numbers come out the same every time the page is rendered, which is
# exactly what we want for teaching.
rng = np.random.default_rng(seed=42)

df = pd.DataFrame({
    "student": ["Ada", "Babbage", "Curie", "Dirac", "Euler"],
    "score": rng.integers(low=70, high=100, size=5),
    "hours_studied": rng.integers(low=2, high=12, size=5),
})
df
student score hours_studied
0 Ada 72 10
1 Babbage 93 2
2 Curie 89 8
3 Dirac 83 4
4 Euler 82 2

That object is a DataFrame. It already knows how to summarize itself:

df.describe()
score hours_studied
count 5.000000 5.00000
mean 83.800000 5.20000
std 7.981228 3.63318
min 72.000000 2.00000
25% 82.000000 2.00000
50% 83.000000 4.00000
75% 89.000000 8.00000
max 93.000000 10.00000

And it can answer a question with a single readable line. “Show me only the students who scored at least 85” looks like this:

df[df["score"] >= 85]
student score hours_studied
1 Babbage 93 2
2 Curie 89 8

If that last line feels a little magical, good. Pulling exactly the rows you want out of a table is the heart of Week 3, and by the end of Module 2 you will write lines like that without thinking. For now, the takeaway from Week 1 is simply that the tools are installed, code cells run, and a table of data is something Python can hold and question.

TipPractice before next week

Re-open this page and change things. Swap the seed in default_rng, raise the cutoff in the filter from 85 to 95, add a classmate to the DataFrame. Nothing here can break in a way that a restart will not fix, so experiment freely. The fastest way to get comfortable is to make a small change, predict the output, then run it.