Lab 5: Names names names

1 Overview

Today’s lab is all about names — specifically, baby names! The U.S. government has kept pretty good statistics about what first names people have been giving their babies for more than 100 years, so it’s a fun data source to get into.

Besides continuing to exercise and develop your general comfort and confidence handling real data sources with Python and pandas, you will also get good practice with parallel programming in today’s lab, like in the recent unit from class.

As usual, you will fill in a markdown file answering specific questions as you go, as well as turn in your working code. This lab has multiple parts, so be sure to pace yourself and reach out for help if you get stuck.

1.1 Deadlines

Milestone: 2359 on Wednesday, 25 March
Complete lab: 2359 on Wednesday, 1 April

1.2 Learning goals

Gain independence and practice with data download and wrangling using Python, pandas, and numpy
Write multiprocessing code to perform multiple tasks at once in parallel

2 Data acquisition (20 pts)

2.1 Markdown file to fill in

Run git pull in your sd212 directory.

You should see a new folder for this lab, with a lab05.md file for you to fill in and submit.

2.2 Baby names

The U.S. Social Security Administration runs a project that collects baby name statistics going back to 1910 on a state by state basis.

Go to this page and download the zip file marked for “state-specific data”.

That zip file has a bunch of txt files that will clutter up your directory. Instead do the following so that your txt files are in their own data subdirectory:

Go into your directory for thi slab.
Make a subdirectory there called data
Extract the zip file in the sd212/lab05/data directory using the unzip command

Use your command line skills to look around at the data files and understand the general layout. Look at the documentation included within the zip file or on the website so you understand what the numbers mean.

Pick a (first) name, birth year, and state that will count as “your name”, “your state”, etc., for the purposes of this lab.

(It can be your actual name, but doesn’t need to be. It just needs to be in the dataset! Do something like

grep 'Daniel,' data/de.txt

and make sure you get at least a dozen or so entries for your chosen name and state.)

Answer a few questions:

What will be “your” name, birth year, and state, for the purposes of this lab?

Write your answer using the state abbreviation like: Madonna 1958 MI
Find yourself in the dataset. How many people with your name (and any sex) were born in your state in your birth year?
Which year had the most births with “your” name in your state?
Which state had the most number of births with your name in your birth year?

2.3 U.S. regions by state

Go to this page to find a nice table which shows all the U.S. states organized into five regions.

Download the table and save it as a plain-text file. Then use Python and/or bash to convert it to a nice CSV file that is comma-separated with a header row, etc. This should not be difficult! Save your file as regions.csv.

Which region is “your” state in, and how many states in total are part of that region?

2.4 Submit

Get the ball rolling with your initial submission:

submit -c=sd212 -p=lab05 lab05.md regions.csv

club -csd212 -plab05 lab05.md regions.csv

or use the web interface

3 Get the data into pandas (30 pts)

Your first substantial goal is to gather all the data from the data directory and regions.txt into a single Pandas dataframe that looks something like this:

        state sex  year      name  count State Name     Region
0          WY   F  1910      Mary     27    Wyoming       West
1          WY   F  1910  Margaret     22    Wyoming       West
2          WY   F  1910     Helen     13    Wyoming       West
3          WY   F  1910     Alice     10    Wyoming       West
4          WY   F  1910   Dorothy      9    Wyoming       West
...       ...  ..   ...       ...    ...        ...        ...
6541959    TX   M  2024    Ziaire      5      Texas  Southwest
6541960    TX   M  2024    Zubair      5      Texas  Southwest
6541961    TX   M  2024     Zyion      5      Texas  Southwest
6541962    TX   M  2024     Zylan      5      Texas  Southwest
6541963    TX   M  2024    Zymere      5      Texas  Southwest

[6541964 rows x 7 columns]

Create a program ingest.py that reads the .txt files from the data/ subdirectory, as well as the regions.csv file, and creates a single DataFrame with columns like shown above (perhaps in a different order).

Put your code in a function get_names() that takes no arguments and just returns the dataframe. Design it well so that if I create a separate file and do something like

from ingest import get_names

names = get_names()
print(names[names['year'] == 1985].sort_values(by=['count']).iloc[-20:])

Then it should work and print the 20 rows showing the most popular baby names in 1985.

Your get_names() function must use multiprocessing to read each file in parallel before combining them into a single dataframe.

I’m not going to tell you exactly how to do this! This kind of “data ingest” can feel tedious but it’s something you should be well prepared to handle by now. Here are some hints to get going in the right direction:

You’ll need to iterate over all the txt files in the data directory. Look back at the credit cards lab where we saw how to use the pathlib module to do something similar.
You will want to write a function to read in just a single file into a new dataframe, and return that dataframe. Note that these files don’t have a header line so you’ll need to specify the headers yourself when you call read_csv.
Then use a ProcessPoolExecutor from Python’s concurrent.futures module launch your concurrent tasks, essentially reading in all the names files in parallel into separate dataframes. (Look back at your notes from class on how to do that.)
Put all the individual dataframes into a list, and then use pd.concat to combine them into one big dataframe
Separately, read in the regions.csv file you created earlier and use pd.merge to get the final big result as shown above.

(Your rows might be in a different order, but check the number of rows and the column headers to see that you have it working correctly.)

3.1 Submit

Save your files and submit everything:

submit -c=sd212 -p=lab05 lab05.md regions.csv ingest.py

club -csd212 -plab05 lab05.md regions.csv ingest.py

or use the web interface

3.2 Milestone

For this lab, the milestone means everything up to this point, which includes the following auto-tests:

md_part3
md_part4

Remember: this milestone is not necessarily the half-way point. Keep going!

4 Who am I? (40 pts)

Now let’s do some data analysis! The goal of this part is to create a program who.py that uses demographic data to make a “guess” of a person’s sex, region, and age based on their first name only.

Here are some example runs:

roche@ubuntu$ python3 who.py
Name: Karen
Karen is most likely Female from the Midwest between 59 and 74 years old

roche@ubuntu$ python3 who.py
Name: Dante
Dante is most likely Male from the Northeast between 11 and 30 years old

There is a lot that goes into making this guess! Let’s break it down. I strongly you suggest you work carefully through the sex and region first before thinking about the age (which is more difficult).

4.1 Read the name

This part seems simple — you just need to make an input() call to get the name that the user types in.

But the twist is that you need to do this in a multiprocessing way. Specifically, your program should be concurrently reading in all the dataframes from the files and combining them, while it waits for the user to type in their search name in the terminal.

Why does this make sense? We know that I/O is slow, and the very slowest kind of I/O is one where the computer has to wait for a human to type something in! By doing this concurrently, the application will feel more “snappy” because your code is doing a lot of prep work in the background while waiting for the human to type something in.

To do this, you need to go back to ingest.py from the first part of the lab. Basically, you want to have just one more task submitted to the ProcessPoolExecutor, where that new task is essentially just calling input() to read the chosen name from the terminal.

Don’t change the get_names() function you already have written — that would break the previous parts of the lab. Instead, just copy the logic into a new function get_names_and_who() in your who.py. This should do the same thing as get_names() (reading in the files in parallel and combining them into a single dataframe), but adding just one more concurrent task to call input and read the name from the terminal. When it’s all finished, your get_names_and_who() should return a tuple with the complete dataframe, and a single string for the name the user typed in.

Now test it! Add a few lines in your who.py to just call get_names_and_who() and then print out the dataframe.

If you did it correctly, then if you take your sweet time to type in the name, the dataframe should print out immediately when you hit enter. Try it!

4.2 Sex

Now let’s try and guess the birth sex based on the name.

To start out with, you have a massive Pandas dataframe with each state/year/sex/name combination listed separately.

You will want to:

Select only the rows of the dataframe that match the given name
Group those rows according to the sex column
Add up the counts for each sex
Sort so that you can extract the sex with the larger count.

These are all the same kinds of things we have seen before with Pahdas at various times.

The trickiest step is probably the grouping and adding up by group. Rather than me tell you how to do that, I just searched the web for how to add one column according to another column and found this short and sweet StackOverflow answer. Go read it! StackOverflow is a great resource and definitely OK to use as long as you add a short comment with # to give the citation of where you got your information.

After this, try modifying your who.py program so that it asks for a name and then gives just a sex prediction.

Which of the following names (list all) have more female births than male?
1. Justice
2. Kris
3. Elisha
4. Kerry
5. Robbie
6. Jaylin
(Enter just the letters of your choices.)

4.3 Region

Figuring out the most likely region will be very similar to discovering the most likely sex, but be careful that you first filter down to the most likely sex before finding the most likely region.

For example, the name “Avery” is overall most popular in the Southeast. But this name is more commonly applied to Females overall, and among female babies, “Avery” occurred more frequently in the Midwest. So for that example, you would want your program to predict Female from the Midwest.

Get your who.py program working to give sex and region predictions.

What region is “Oleg” most likely from?

4.4 Age range

Now you are ready for the toughest part, calculating the age range.

The goal here is to find the smallest age range that covers at least 51% of the births for the given name and calculated sex and region.

A few details on what we are looking for here:

Assume no one ever dies. (So for example, we’ll suppose rather optimistically that all Helens born in Wyoming in 1925 are still alive.)
Don’t worry about birthdays; estimate age as simply (2026 - birthyear). Those Wyoming Helens from 1925 are all 101 years old.
By “smallest age range”, we mean the smallest span of years to cover at least 51% of the total.
If there are multiple possibilities with the same smallest span, return the one for the youngest people.

In the “Karen” example above, the shortest span is 15 years. Both (ages 66 to 81) and (ages 59 to 74) pass the 51% mark, but your program should print ages 59 to 74 since that’s the youngest/most recent.

(The reason to prefer younger ages here is because we aren’t really properly accounting for deaths.)

The approach I recommend to figure this out in code is something like this:

Use pandas to isolate the series that you want: the birth counts for the given name, sex, and region, organized by year.
First count the total number of births for the given name, sex, and region, as a single number.
Multiply that by 51 percent to get your target number
Now make a loop to try out each starting year between the earliest year in the data set and the current year.
Within this outer loop, make an inner loop for the ending year, from that starting year until the current year
Within the inner loop, actually add up all the births in that (starting,ending) year range.
If the total matches or exceeds your target number, break the inner loop and print out the (starting,ending) year range as a potential answer.
Once that works, use a few variables to just keep track of the shortest (starting,ending) year range rather than printing it out.
Finally, convert the best (starting,ending) year pair to ages and pat yourself on the back. You got it!

4.5 Check yourself

Be sure to test your who.py program on plenty of names. Some less popular names have missing entries for various states and years, maybe even for entire regions — and your code should handle that!

Double-check that the formatting of what your program prints is also accurate.

And of course you should answer some questions:

What is the most likely sex, region, and age range for “your” name (for the purposes of this lab)?
If we released this tool on a website for public use, what potential ethical issues can you imagine might arise? Are there any groups or individuals who could be harmed by the use of this tool? Do you have any ideas on how to try and reduce that potential harm?

4.6 Submit

Save your files and submit everything so far:

submit -c=sd212 -p=lab05 lab05.md regions.csv ingest.py who.py

club -csd212 -plab05 lab05.md regions.csv ingest.py who.py

or use the web interface

5 Visualize (10 pts)

You probably knew this was coming! So far your who.py program is really cool in its attempt to predict sex, region, and age, but it is only giving a small glimpse of the overall data.

For example, “Riley” and “Maria” are both most likely to be female names, but the female/male ratio for “Riley” is close to even where as for “Maria” it’s closer to 100:1.

Similarly, “Elijah” and “James” are both most likely to be born in the Southeast, but James is more evenly distributed around the country, whereas the prevalence of Elijah in the southeast is almost double what it is in any other region.

Your task is to write a program whoviz.py which reads in a name (like before) and creates a visualization that gives a richer picture than just the most likely categories.

It is open-ended exactly what this should look like, so be creative and come up with something cool! It’s okay if your visualization only covers some part of the information about that name (sex, location, and age range), but it would be really impressive if you could incorporate multiple aspects together!

Either way, your visualization must give a clear, easy to understand picture of the full distribution of that name along one or more categories.

Some ideas of what you could try:

Shading a map of the 50 U.S. states according to name distribution
A bar graph showing the relative frequency of the name among different age groups or generations (Silent gen, Boomers, Gen X, Millennials, Gen Z, etc)
Maybe use shading to show the male/female ratio for a name

Submit your code as whoviz.py which should read in a name from the user, then generate and pop up a visualization.

Run your own code for “your” name and save that image as me.png to submit as well.

Describe briefly what your visualization is showing.
Tell me another name I should try where your visualization is interesting or nice-looking in some way. (And explain your choice briefly.)

5.1 Submit

Save your files and submit everything so far:

submit -c=sd212 -p=lab05 lab05.md regions.csv ingest.py who.py whoviz.py me.png

club -csd212 -plab05 lab05.md regions.csv ingest.py who.py whoviz.py me.png

or use the web interface

SD 212 Spring 2026 / Labs