As I mentioned in the last post, to make our lessons more meaningful, we’ll be working with a fictional dataset that contains demographic, socioeconomic, and health variables. Think of it as data collected from a survey.
The dataset has 100 rows, and each row represents one person.
Included Variables
Demographic Variables
- ID – A unique ID number for each person (integer)
- State – The U.S. state where the person lives (string)
- City – The person’s city of residence (string)
- Year of Birth – The person’s birth year (integer)
- Age – The person’s age, based on the current year (integer)
- Sex – Gender (
"Male"
or"Female"
) (string) - Race – Racial category (
"White"
,"Black"
,"Hispanic"
, etc.) (string)
Socioeconomic Variables
- Education – Highest level of education completed (e.g.
"High School"
,"Bachelor's"
) - Income – Annual income in thousands of dollars (integer; some values may be missing)
- Employment Status – Whether they’re employed, unemployed, a student, or retired (string)
Health Variables
- Self-Rated Health – How the individual perceives their own health (
"Excellent"
,"Good"
,"Fair"
,"Poor"
) (string) - Hypertension – Whether the individual has high blood pressure (
"Yes"
,"No"
) (string) - BMI – Body Mass Index, calculated from height and weight (float, may contain missing values)
Example Data Row
Here’s what rows in the dataset might look like.
ID | State | City | Age | Sex | Race | Education | Income | Employment Status | Self-Rated Health | Hypertension | BMI |
---|---|---|---|---|---|---|---|---|---|---|---|
123456 | Texas | Houston | 34 | Male | Black | Bachelor’s | 55000 | Employed | Good | No | 24.5 |
987654 | California | Los Angeles | 29 | Female | White | High School | 32000 | Student | Excellent | No | 22.1 |
Understanding Data Types
When working with survey data, you’ll often see a mix of data types.
- Strings: Text-based values like names and cities.
- Integers: Whole numbers like age and income.
- Floats: Decimal numbers, like BMI.
- Categorical Variables: Limited values like education levels or employment status.
Quick Practice
Let’s create a small version of our dataset using Python variables. We learned how to create variables in Python in the last post.
person_id = 123456 # Unique ID (integer)
state = "Texas" # State name (string)
age = 34 # Age in years (integer)
income = 55000 # Income in dollars (integer)
is_employed = True # Boolean variable for employment status
print("Person ID:", person_id)
print("Lives in:", state)
print("Age:", age)
print("Annual Income:", income)
print("Employed?", is_employed)
Breaking Down the Code
Let’s analyze each part of our code to understand what’s happening and how it relates to our dataset. Below are some questions, I get from students.
1. What is Immediately After print
?
After print
, we see a string (in quotes) followed by a variable. The string inside print()
is not tied to the dataset—it’s just a label that makes the output readable.
Each print( ) line has:
- A label in quotes (e.g., “Person ID:”)
- A variable (e.g., person_id)
Example:
print("Person ID:", person_id)
"Person ID:"
→ This is a string that acts as a label.person_id
→ This is a variable holding a value (an integer in this case).- The comma (
,
) separates the label and the actual variable value.
Output:
Person ID: 123456
Keep in mind that the label does NOT change the variable name.
These would all be the same. We’re just changing how we describe the output.
They will work because the variable name (is_employed
) does not change, only the text inside print()
changes.
print("Is the person employed?", is_employed)
print("Job Status:", is_employed)
print("Employment Check:", is_employed)
print("Works for a company?", is_employed)
2. Why Use Colons (:
) in print
Statements?
The colon (:
) in print statements is just my preference for formatting the output.
Example:
print("Annual Income:", income)
Prints like this:
Annual Income: 55000
It separates the label from the value, which makes the output more readable.
But you could also do:
print("Annual Income", income)
Which prints:
Annual Income 55000
Both are fine, but the colon just makes it a little easier to read.
In the next post, we’ll start working with different data types and operations.
Recommended Python Books
- Fluent Python: Clear, Concise, and Effective Programming
- Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python
- Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter
- Learning Python: Powerful Object-Oriented Programming
- Python Crash Course
- Python Programming for Beginners
- Pandas Cookbook