Lab 5: Names names names
1 Overview

Today’s lab is all about names — specifically, baby names! The U.S. government has kept pretty good statistics about what first names people have been giving their babies for more than 100 years, so it’s a fun data source to get into.
Besides continuing to exercise and develop your general comfort and confidence handling real data sources with Python and pandas, you will also get good practice with parallel programming in today’s lab, like in the recent unit from class.
As usual, you will fill in a markdown file answering specific questions as you go, as well as turn in your working code. This lab has multiple parts, so be sure to pace yourself and reach out for help if you get stuck.
1.1 Deadlines- Milestone: 2359 on Wednesday, 25 March
- Complete lab: 2359 on Wednesday, 1 April
1.2 Learning goals
- Gain independence and practice with data download and wrangling
using Python, pandas, and numpy
- Write multiprocessing code to perform multiple tasks at once in
parallel
2 Data acquisition (20 pts)
2.1 Markdown file to fill in
- Gain independence and practice with data download and wrangling using Python, pandas, and numpy
- Write multiprocessing code to perform multiple tasks at once in parallel
2 Data acquisition (20 pts)
2.1 Markdown file to fill in
Run git pull in your sd212 directory.
You should see a new folder for this lab, with a
lab05.md file for you to fill in and submit.
2.2 Baby names
The U.S. Social Security Administration runs a project that collects baby name statistics going back to 1910 on a state by state basis.
Go to this page and download the zip file marked for “state-specific data”.
That zip file has a bunch of txt files that will clutter up your
directory. Instead do the following so that your txt files are in their
own data subdirectory:
- Go into your directory for thi slab.
- Make a subdirectory there called
data - Extract the zip file in the
sd212/lab05/datadirectory using theunzipcommand
Use your command line skills to look around at the data files and understand the general layout. Look at the documentation included within the zip file or on the website so you understand what the numbers mean.
Pick a (first) name, birth year, and state that will count as “your name”, “your state”, etc., for the purposes of this lab.
(It can be your actual name, but doesn’t need to be. It just needs to be in the dataset! Do something like
grep 'Daniel,' data/de.txt
and make sure you get at least a dozen or so entries for your chosen name and state.)
Answer a few questions:
What will be “your” name, birth year, and state, for the purposes of this lab?
Write your answer using the state abbreviation like:
Madonna 1958 MIFind yourself in the dataset. How many people with your name (and any sex) were born in your state in your birth year?
Which year had the most births with “your” name in your state?
Which state had the most number of births with your name in your birth year?
2.3 U.S. regions by state
Go to this page to find a nice table which shows all the U.S. states organized into five regions.
Download the table and save it as a plain-text file. Then use Python
and/or bash to convert it to a nice CSV file that is comma-separated
with a header row, etc. This should not be difficult! Save your file as
regions.csv.
- Which region is “your” state in, and how many states in total are part of that region?
2.4 Submit
Get the ball rolling with your initial submission:
submit -c=sd212 -p=lab05 lab05.md regions.csvor
club -csd212 -plab05 lab05.md regions.csvor use the web interface
3 Get the data into pandas (30 pts)
Your first substantial goal is to gather all the data from the
data directory and regions.txt into a single Pandas dataframe that
looks something like this:
state sex year name count State Name Region
0 WY F 1910 Mary 27 Wyoming West
1 WY F 1910 Margaret 22 Wyoming West
2 WY F 1910 Helen 13 Wyoming West
3 WY F 1910 Alice 10 Wyoming West
4 WY F 1910 Dorothy 9 Wyoming West
... ... .. ... ... ... ... ...
6541959 TX M 2024 Ziaire 5 Texas Southwest
6541960 TX M 2024 Zubair 5 Texas Southwest
6541961 TX M 2024 Zyion 5 Texas Southwest
6541962 TX M 2024 Zylan 5 Texas Southwest
6541963 TX M 2024 Zymere 5 Texas Southwest
[6541964 rows x 7 columns]
Create a program ingest.py that reads the .txt files from the
data/ subdirectory, as well as the regions.csv file, and creates a
single DataFrame with columns like shown above (perhaps in a different
order).
Put your code in a function get_names() that takes no arguments and
just returns the dataframe. Design it well so that if I create a separate
file and do something like
from ingest import get_names
names = get_names()
print(names[names['year'] == 1985].sort_values(by=['count']).iloc[-20:])
Then it should work and print the 20 rows showing the most popular baby names in 1985.
Your get_names() function must use multiprocessing
to read each file in parallel before combining them into a single
dataframe.
I’m not going to tell you exactly how to do this! This kind of “data ingest” can feel tedious but it’s something you should be well prepared to handle by now. Here are some hints to get going in the right direction:
You’ll need to iterate over all the txt files in the
datadirectory. Look back at the credit cards lab where we saw how to use the pathlib module to do something similar.You will want to write a function to read in just a single file into a new dataframe, and return that dataframe. Note that these files don’t have a header line so you’ll need to specify the headers yourself when you call
read_csv.Then use a
ProcessPoolExecutorfrom Python’s concurrent.futures module launch your concurrent tasks, essentially reading in all the names files in parallel into separate dataframes. (Look back at your notes from class on how to do that.)Put all the individual dataframes into a list, and then use
pd.concatto combine them into one big dataframeSeparately, read in the
regions.csvfile you created earlier and usepd.mergeto get the final big result as shown above.(Your rows might be in a different order, but check the number of rows and the column headers to see that you have it working correctly.)
3.1 Submit
Save your files and submit everything:
submit -c=sd212 -p=lab05 lab05.md regions.csv ingest.pyor
club -csd212 -plab05 lab05.md regions.csv ingest.pyor use the web interface
3.2 Milestone
For this lab, the milestone means everything up to this point, which includes the following auto-tests:
md_part3
md_part4
Remember: this milestone is not necessarily the half-way point. Keep going!
4 Who am I? (40 pts)
Now let’s do some data analysis! The goal of this part is to create
a program who.py that uses demographic data to make a “guess” of a
person’s sex, region, and age based on their first name only.
Here are some example runs:
roche@ubuntu$python3 who.pyName:KarenKaren is most likely Female from the Midwest between 59 and 74 years old
roche@ubuntu$python3 who.pyName:DanteDante is most likely Male from the Northeast between 11 and 30 years old
There is a lot that goes into making this guess! Let’s break it down. I strongly you suggest you work carefully through the sex and region first before thinking about the age (which is more difficult).
4.1 Read the name
This part seems simple — you just need to make an input() call to
get the name that the user types in.
But the twist is that you need to do this in a multiprocessing way. Specifically, your program should be concurrently reading in all the dataframes from the files and combining them, while it waits for the user to type in their search name in the terminal.
Why does this make sense? We know that I/O is slow, and the very slowest kind of I/O is one where the computer has to wait for a human to type something in! By doing this concurrently, the application will feel more “snappy” because your code is doing a lot of prep work in the background while waiting for the human to type something in.
To do this, you need to go back to ingest.py from the first part of
the lab. Basically, you want to have just one more task submitted to
the ProcessPoolExecutor, where that new task is essentially just
calling input() to read the chosen name from the terminal.
Don’t change the get_names() function you already have written —
that would break the previous parts of the lab. Instead, just copy
the logic into a new function get_names_and_who() in your who.py.
This should do the same
thing as get_names() (reading in the files in parallel and combining
them into a single dataframe), but adding just one more concurrent task
to call input and read the name from the terminal.
When it’s all finished, your get_names_and_who() should return a
tuple with the complete dataframe, and a single string for the name the
user typed in.
Now test it! Add a few lines in your who.py to just call
get_names_and_who() and then print out the dataframe.
If you did it correctly, then if you take your sweet time to type in the name, the dataframe should print out immediately when you hit enter. Try it!
4.2 Sex
Now let’s try and guess the birth sex based on the name.
To start out with, you have a massive Pandas dataframe with each state/year/sex/name combination listed separately.
You will want to:
- Select only the rows of the dataframe that match the given name
- Group those rows according to the sex column
- Add up the counts for each sex
- Sort so that you can extract the sex with the larger count.
These are all the same kinds of things we have seen before with Pahdas at various times.
The trickiest step is probably the grouping and adding up by group.
Rather than me tell you how to do that, I just searched the web for how
to add one column according to another column and found this short and
sweet StackOverflow answer.
Go read it! StackOverflow is a great resource and definitely OK to use
as long as you add a short comment with # to give the citation of
where you got your information.
After this, try modifying your who.py program so that it asks for a
name and then gives just a sex prediction.
Which of the following names (list all) have more female births than male?
- Justice
- Kris
- Elisha
- Kerry
- Robbie
- Jaylin
(Enter just the letters of your choices.)
4.3 Region
Figuring out the most likely region will be very similar to discovering the most likely sex, but be careful that you first filter down to the most likely sex before finding the most likely region.
For example, the name “Avery” is overall most popular in the Southeast. But this name is more commonly applied to Females overall, and among female babies, “Avery” occurred more frequently in the Midwest. So for that example, you would want your program to predict Female from the Midwest.
Get your who.py program working to give sex and region
predictions.
- What region is “Oleg” most likely from?
4.4 Age range
Now you are ready for the toughest part, calculating the age range.
The goal here is to find the smallest age range that covers at least 51% of the births for the given name and calculated sex and region.
A few details on what we are looking for here:
Assume no one ever dies. (So for example, we’ll suppose rather optimistically that all Helens born in Wyoming in 1925 are still alive.)
Don’t worry about birthdays; estimate age as simply
(2026 - birthyear). Those Wyoming Helens from 1925 are all 101 years old.By “smallest age range”, we mean the smallest span of years to cover at least 51% of the total.
If there are multiple possibilities with the same smallest span, return the one for the youngest people.
In the “Karen” example above, the shortest span is 15 years. Both (ages 66 to 81) and (ages 59 to 74) pass the 51% mark, but your program should print ages 59 to 74 since that’s the youngest/most recent.
(The reason to prefer younger ages here is because we aren’t really properly accounting for deaths.)
The approach I recommend to figure this out in code is something like this:
Use pandas to isolate the series that you want: the birth counts for the given name, sex, and region, organized by year.
First count the total number of births for the given name, sex, and region, as a single number.
Multiply that by 51 percent to get your target number
Now make a loop to try out each starting year between the earliest year in the data set and the current year.
Within this outer loop, make an inner loop for the ending year, from that starting year until the current year
Within the inner loop, actually add up all the births in that (starting,ending) year range.
If the total matches or exceeds your target number, break the inner loop and print out the (starting,ending) year range as a potential answer.
Once that works, use a few variables to just keep track of the shortest (starting,ending) year range rather than printing it out.
Finally, convert the best (starting,ending) year pair to ages and pat yourself on the back. You got it!
4.5 Check yourself
Be sure to test your who.py program on plenty of names. Some less
popular names have missing entries for various states and years, maybe
even for entire regions — and your code should handle that!
Double-check that the formatting of what your program prints is also accurate.
And of course you should answer some questions:
What is the most likely sex, region, and age range for “your” name (for the purposes of this lab)?
If we released this tool on a website for public use, what potential ethical issues can you imagine might arise? Are there any groups or individuals who could be harmed by the use of this tool? Do you have any ideas on how to try and reduce that potential harm?
4.6 Submit
Save your files and submit everything so far:
submit -c=sd212 -p=lab05 lab05.md regions.csv ingest.py who.pyor
club -csd212 -plab05 lab05.md regions.csv ingest.py who.pyor use the web interface
5 Visualize (10 pts)
You probably knew this was coming! So far your who.py program is
really cool in its attempt to predict sex, region, and age, but it is
only giving a small glimpse of the overall data.
For example, “Riley” and “Maria” are both most likely to be female names, but the female/male ratio for “Riley” is close to even where as for “Maria” it’s closer to 100:1.
Similarly, “Elijah” and “James” are both most likely to be born in the Southeast, but James is more evenly distributed around the country, whereas the prevalence of Elijah in the southeast is almost double what it is in any other region.
Your task is to write a program whoviz.py which reads in a name
(like before) and creates a visualization that gives a richer
picture than just the most likely categories.
It is open-ended exactly what this should look like, so be creative and come up with something cool! It’s okay if your visualization only covers some part of the information about that name (sex, location, and age range), but it would be really impressive if you could incorporate multiple aspects together!
Either way, your visualization must give a clear, easy to understand picture of the full distribution of that name along one or more categories.
Some ideas of what you could try:
- Shading a map of the 50 U.S. states according to name distribution
- A bar graph showing the relative frequency of the name among different age groups or generations (Silent gen, Boomers, Gen X, Millennials, Gen Z, etc)
- Maybe use shading to show the male/female ratio for a name
Submit your code as whoviz.py which should read in a name from the
user, then generate and pop up a visualization.
Run your own code for “your” name and save that image as me.png to
submit as well.
Describe briefly what your visualization is showing.
Tell me another name I should try where your visualization is interesting or nice-looking in some way. (And explain your choice briefly.)
5.1 Submit
Save your files and submit everything so far:
submit -c=sd212 -p=lab05 lab05.md regions.csv ingest.py who.py whoviz.py me.pngor
club -csd212 -plab05 lab05.md regions.csv ingest.py who.py whoviz.py me.pngor use the web interface