31 January 2017

Boston Haikai 134 -- The Sad One About a Bird

Body of a bird
Squashed flat into the sidewalk
Ruins my lunch break
Jan 30 2017 -- Comm Ave

25 January 2017

Boston Haikai 133 -- Layers of lights

Three layers of lights:
Nighttime windows, jet liners,
And at last, faint stars
Jan 25 2017

Boston Haikai 132 -- Geese at rest

6 AM, at least,
Finds the geese at rest; floating
On glassy water
Jan 25 2017 -- Charles River

19 January 2017

Boston Haikai 131 -- Rain

All sensible birds
Are warm and out of the rain,
Watching me walk home.

11 January 2017

Boston Haikai 130 -- Home from the Holidays

Until next Christmas,
One thousand twenty-eight miles
Of salt-stained asphalt.
7 Jan 2017 -- I-90

Boston Haikai 129 -- Subway air

Down in the subway
Where the air bits the face less
But smells more like piss
11 Jan 2017 -- Kenmore Square

Boston Haikai 128 -- Orange snow

Fresh snow colored orange
By tungsten coil street lamps;
Night at five-thirty
11 Jan 2017 -- Comm. Ave.

06 January 2017

On Stereotypical Names

Because I am the kind of person that I am, I recently started to wonder if I could objectively determine the most stereotypical name associated with each state.
We all know that there are certain names from each state that are just so… state, you know? Names like Tyson Nielsen, of UT, Or Brendan Sullivan, MA. It’s a fun game to sit around with friends and try to think up the most stereotypical name for the states we love to tease.
‘But can’t math tell us more?’ I said to myself. ‘If math is good for anything, it must be able to help me make fun of people more effectively. But how?’
And so I set out to find a way to quantify how “state-y” a given name in a given state is. If I could find some good mathematical measure of “state-iness” I could run the formula over a large collection of census information and find the most “state-y” name for all the states. I did both of these things, and here is a summary of my attempt.

The Maps

Here are my official v1.0 maps of the most state-y names in each state, one for each sex. For personal relevancy, I restricted my analysis to names of people who were age 20-30 in 2010.
Map of most distinctive boy names
Most distinctive baby boy names born between 1980 and 1990
Map of most distinctive girl names
Most distinctive baby girl names born between 1980 and 1990

The Measure

It’s not immediately obvious how to measure statiness. Whatever we do, we should somehow capture the notion that the most state-y name is the one that is the most common in the state, without being common in the rest of the country. If 90% of people in every state are named Michael, we don’t want Michael to be the most state-y name for any of the states because it’s equally common everywhere.
On the flip side, we don’t want the one weird guy named “Zapron” in Montana to represent Montana. There may be more Zaprons in Montana than anywhere else, but there’s not enough to make it stereotypically Montana-ish.
My approach was statistical. I compare the abundance of a particular name in the given state to the abundance you’d expect if names were distributed mostly evenly and randomly across the country. When the actual abundance is much higher than the expected random abundance, the name gets a high state-iness rating in that state.
In theory this takes care of both Michael and Zapron. Because the abundance of Michaels is the same in every state no one Michael sticks out. Zapron is taken out because the measure takes into account small statistical fluctuations. The expected number of Zaprons in any state is very small, but finding one is not a terribly unlikely fluctuation, so it is discounted as well.

The Math

My mathematical model is as follows. The statiness S of a name is the negative log-likelihood that a name appears n_s times in the state according to a Poisson distribution with a mean that matches the national average, adjusted for state size, normalized against the likelihood of attaining the expected value.
S = \log(n_s!)-\log(\overline{n}_s!)-(n_s-\overline{n}_s)\log(\overline{n}_s)
\overline{n}_s = n_{US}\times \frac{s}{US}
As I was searching for data to run my analysis on I found a map made by someone with a similar goal. He mapped the most distinctive surnames in each state, using a slightly different measure than mine. His formula was
S' = \frac{n_s}{s} - \frac{n_{US}}{US}
The idea is roughly the same: find names which are more likely to be found in the particular state than the nation as a whole. In principle I can’t think of anything really wrong with this metric, but in practice I liked the results I got with my metric better. I think my metric produced more distinctive names, at least from the data I had. (Also subtracting probabilities is just wrong!)
As a note, I also tried a model where I used the negative log-likelihood from a binomial distribution of finding n_s people with a name after s trials with an individual probability of n_{US}/US. This yielded mostly the same results as the Poisson method, and I used the Poisson one to generate the results below.

The Data

It turns out the Social Security Administration has some great datasets. For example, they have a dataset with baby given names broken out by state, year, and sorted by frequency for all years from 1916 to 2000. Jackpot!
The main caveat with this data is that it doesn’t list names with fewer than 5 appearances in any given state in a given year. The immediate issue, that we are missing potentially high-statiness names, isn’t huge because our metric discounts infrequent names. However, my code assumes that that the number of names in each state and in the whole country is the same as the sum of all the names listed in the data. If there are too many unlisted names in every state, this assumption is wrong and that skews my analysis. Like a good scientist, I chose to ignore this issue and hope for the best.

The Commentary

I’m happy that these results seem to actually be interesting. There’s possibly room for improvement, but this is a good start.
The first thing I notice (besides the fact that the metric really loves ‘Tyler’), is that the results show geographic correlation. The Deep South loves William, the Northwest loves Tyler. All the usual geographic regions share some distinctive names. This suggests that the metric is working; the distinctive names are capturing cultural groupings that we already know about.
In the same vein, there are some names with obvious explanations. We know exactly why Spanish names dominate the Mexican border states, and why DC and MD’s lists turn out the way they did. This is more assurance that the names correspond to real world effects.
The big question is whether this is giving us good stereotypical names. The only real way to tell is to compare against my pre-existing stereotypes. Happily, the metric appears to work pretty well for the states I know. I imagine anyone who knows Utah is looking at “Tyson, Skyler, Trevor” and thinking “Of course!” Similarly, seeing “Brendan” at the top of MA’s list is pleasing the evil little stereotyper that lives in my brain. That said, I need people who are familiar with other states to help increase my confidence that these are good stereotypical names.
Interestingly, the girl names look a bit different. There seems to be less close geographical correlation. Look at all the places “Sarah” and “Amber” appear, even at the top of the list. I don’t have a good explanation for this, but I suspect that it’s a real effect and not an artifact of my metric because it shows up for one sex but not the other. Overall I am less confident that the girl names here represent good stereotypical state names, but I think that it’s maybe because of the way people name girls as opposed to something else.

The Future

There is plenty more to do with this data and analysis. I can increase the age range, or look into past naming trends. I am interested to see whether the geographical homogeneity of the girl names decreases as we go back in time.
We can also use a similar analysis to look at trends over time, and confirm once and for all that Ethel is an old person name.
I welcome suggestions for analyses to run. Also, here is the Python code I used in my analysis, in case someone wants to run their own analyses on the data.