Showing posts with label Math. Show all posts

06 January 2017

On Stereotypical Names

Because I am the kind of person that I am, I recently started to wonder if I could objectively determine the most stereotypical name associated with each state.

We all know that there are certain names from each state that are just so… state, you know? Names like Tyson Nielsen, of UT, Or Brendan Sullivan, MA. It’s a fun game to sit around with friends and try to think up the most stereotypical name for the states we love to tease.

‘But can’t math tell us more?’ I said to myself. ‘If math is good for anything, it must be able to help me make fun of people more effectively. But how?’

And so I set out to find a way to quantify how “state-y” a given name in a given state is. If I could find some good mathematical measure of “state-iness” I could run the formula over a large collection of census information and find the most “state-y” name for all the states. I did both of these things, and here is a summary of my attempt.

The Maps

Here are my official v1.0 maps of the most state-y names in each state, one for each sex. For personal relevancy, I restricted my analysis to names of people who were age 20-30 in 2010.

Most distinctive baby boy names born between 1980 and 1990

Most distinctive baby girl names born between 1980 and 1990

The Measure

It’s not immediately obvious how to measure statiness. Whatever we do, we should somehow capture the notion that the most state-y name is the one that is the most common in the state, without being common in the rest of the country. If 90% of people in every state are named Michael, we don’t want Michael to be the most state-y name for any of the states because it’s equally common everywhere.

On the flip side, we don’t want the one weird guy named “Zapron” in Montana to represent Montana. There may be more Zaprons in Montana than anywhere else, but there’s not enough to make it stereotypically Montana-ish.

My approach was statistical. I compare the abundance of a particular name in the given state to the abundance you’d expect if names were distributed mostly evenly and randomly across the country. When the actual abundance is much higher than the expected random abundance, the name gets a high state-iness rating in that state.

In theory this takes care of both Michael and Zapron. Because the abundance of Michaels is the same in every state no one Michael sticks out. Zapron is taken out because the measure takes into account small statistical fluctuations. The expected number of Zaprons in any state is very small, but finding one is not a terribly unlikely fluctuation, so it is discounted as well.

The Math

My mathematical model is as follows. The statiness $S$ of a name is the negative log-likelihood that a name appears $n_s$ times in the state according to a Poisson distribution with a mean that matches the national average, adjusted for state size, normalized against the likelihood of attaining the expected value.

$S = \log(n_s!)-\log(\overline{n}_s!)-(n_s-\overline{n}_s)\log(\overline{n}_s)$
$\overline{n}_s = n_{US}\times \frac{s}{US}$

As I was searching for data to run my analysis on I found a map made by someone with a similar goal. He mapped the most distinctive surnames in each state, using a slightly different measure than mine. His formula was

$S' = \frac{n_s}{s} - \frac{n_{US}}{US}$

The idea is roughly the same: find names which are more likely to be found in the particular state than the nation as a whole. In principle I can’t think of anything really wrong with this metric, but in practice I liked the results I got with my metric better. I think my metric produced more distinctive names, at least from the data I had. (Also subtracting probabilities is just wrong!)

As a note, I also tried a model where I used the negative log-likelihood from a binomial distribution of finding $n_s$ people with a name after $s$ trials with an individual probability of $n_{US}/US$ . This yielded mostly the same results as the Poisson method, and I used the Poisson one to generate the results below.

The Data

It turns out the Social Security Administration has some great datasets. For example, they have a dataset with baby given names broken out by state, year, and sorted by frequency for all years from 1916 to 2000. Jackpot!

The main caveat with this data is that it doesn’t list names with fewer than 5 appearances in any given state in a given year. The immediate issue, that we are missing potentially high-statiness names, isn’t huge because our metric discounts infrequent names. However, my code assumes that that the number of names in each state and in the whole country is the same as the sum of all the names listed in the data. If there are too many unlisted names in every state, this assumption is wrong and that skews my analysis. Like a good scientist, I chose to ignore this issue and hope for the best.

The Commentary

I’m happy that these results seem to actually be interesting. There’s possibly room for improvement, but this is a good start.

The first thing I notice (besides the fact that the metric really loves ‘Tyler’), is that the results show geographic correlation. The Deep South loves William, the Northwest loves Tyler. All the usual geographic regions share some distinctive names. This suggests that the metric is working; the distinctive names are capturing cultural groupings that we already know about.

In the same vein, there are some names with obvious explanations. We know exactly why Spanish names dominate the Mexican border states, and why DC and MD’s lists turn out the way they did. This is more assurance that the names correspond to real world effects.

The big question is whether this is giving us good stereotypical names. The only real way to tell is to compare against my pre-existing stereotypes. Happily, the metric appears to work pretty well for the states I know. I imagine anyone who knows Utah is looking at “Tyson, Skyler, Trevor” and thinking “Of course!” Similarly, seeing “Brendan” at the top of MA’s list is pleasing the evil little stereotyper that lives in my brain. That said, I need people who are familiar with other states to help increase my confidence that these are good stereotypical names.

Interestingly, the girl names look a bit different. There seems to be less close geographical correlation. Look at all the places “Sarah” and “Amber” appear, even at the top of the list. I don’t have a good explanation for this, but I suspect that it’s a real effect and not an artifact of my metric because it shows up for one sex but not the other. Overall I am less confident that the girl names here represent good stereotypical state names, but I think that it’s maybe because of the way people name girls as opposed to something else.

The Future

There is plenty more to do with this data and analysis. I can increase the age range, or look into past naming trends. I am interested to see whether the geographical homogeneity of the girl names decreases as we go back in time.

We can also use a similar analysis to look at trends over time, and confirm once and for all that Ethel is an old person name.

I welcome suggestions for analyses to run. Also, here is the Python code I used in my analysis, in case someone wants to run their own analyses on the data.

16 September 2016

Why is North North? Part 2, The Hairy Ball Theorem

A few days ago I wrote too many words about how north, south, east, and west are defined.

The purpose then was to answer the question of why you can't walk north forever, but you can walk east forever. The answer was that the definition of north/south has a "coordinate singularity", a point where the definition breaks down and can't uniquely name different directions, and also that walking north by definition leads you to the singularity.

We also saw that the east/west scheme has a coordinate singularity at the poles, but that east/west was conveniently designed to never lead you to the singularity like north and south were.

Then at the end I raised a question: can we do better? Is there a directional scheme we can choose for the earth so that there are no coordinate singularities anywhere? If so, what would that scheme look like?

Going abstract

To be able to answer this question we first need to think clearly about what constitutes a direction scheme. This concept may not be obvious, but it's pretty simple when you think about it. A direction scheme is a rule that points out a direction at every point on a sphere. Imagine arrows painted on every point of the sphere. When you stand at a point, wherever the arrow points is the direction associated with that scheme.

The direction scheme called "north." At every point there is an arrow pointing north. If you walk along the arrows you are heading north.

A hypothetical direction scheme called "crazynorth." Wherever you are on the sphere, if you walk along the red line there then you are heading crazynorth. We'd never actually use this scheme, but we want to consider all possibilities in general.

We'll also require that a direction scheme be smooth, that is it doesn't change abruptly at any point, because it would be confusing for your directions to suddenly switch just because you moved a millimeter.

A rule that assigns an arrow like this to every point on a surface is called a "vector field" by mathematicians.

One way to picture vector fields is to imagine that the surface in question has hair growing out of it that you then comb. Every hair has a corresponding arrow that represents whose size and direction represents how parallel to the surface the hair is. If the hair is completely flat against the surface then we draw an arrow with length 1. If the hair is sticking straight out of the surface we draw no arrow.

Hairs on a surface and how they correspond to arrows (or vectors) on that surface.

Singularities (again)

What does a coordinate singularity mean in the context of vector fields? In the last post we said a coordinate singularity is a point where directions are not defined. When we interpret arrows from a vector field as directions, we can see that the directions are undefined when the arrows have zero length (so you can't tell which way they point), or they have infinite length (because that's just not defined), or when the surrounding vector field isn't smooth and changes abruptly at a point (because at the point of change there are multiple definitions for the direction).

Two kinds of singularity. In (a) there is a point with a zero length arrow. In (b) there is a point of discontinuity, where the vector field changes abruptly. In both cases we can't use the vector fields to uniquely distinguish directions. (a) is the kind of singularity east/west has, while north/south has the type in (b)

Thus, the our question about finding singularity-free direction schemes becomes "can we find a smoothly varying vector field on a sphere that such that all the arrows are finite and not zero at every point?"

In terms of the hairy surface, a coordinate singularity then corresponds to a place where the hair sticks straight up.

Golly!

The Hairy Ball Theorem

(Try not to laugh too hard.) Now the best part. It turns out there is a definite answer to the question I just asked. The answer is no, it is not possible. Every smooth vector field on a sphere has at least one point where the the arrow has zero length. In terms of hairy balls, every time you arrange the hair on a sphere there will be a point where the hair stands straight up.

An attempted vector field with singularities that look like hairy cowlicks at the poles.

This means that any direction scheme we try to devise on the surface of a sphere will have a point where the direction is not defined! We will always have the same problem we had with the north/south, east/west direction schemes. There will always be points where our direction scheme isn't defined. The usual north,south,east,west scheme is the best system we can get.

I'm not going to explain how you prove this theorem, but it is very general. The Hairy Ball Theorem applies to spheres, but also any shape that is sphere-like (in the sense that you can squeeze it into a sphere shape without tearing any holes). So you also can't find a singularity-free direction scheme for an egg, or a banana, or any simple 3D object. It doesn't apply to other shapes though: it's easy to find good direction schemes on a donut, or on an infinite plane.

And the Hairy Ball Theorem doesn't just apply to maps and globes. It implies that it is impossible to design a radio antenna that doesn't have a blind spot. It implies that there is always at least one wind cyclone on the earth (because there must be a point with no wind).

The center of a cyclone is a zero singularity in a surface wind pattern

Electrical radiation acts like vectors on a sphere, so every antenna has a point with zero signal. Image courtesy https://de.wikipedia.org/wiki/Datei:Felder_um_Dipol.jpg

Math is cool, man.

15 September 2016

Why is North North?

Why is north North?

My cousin recently asked an interesting question:

Why can you travel east indefinitely but you can only travel north until you reach the north pole?

As with all good questions, there are actually a bunch of good questions here all packed in together. As is my wont, I'm going to go way too deep, and hopefully come out the other side of this question with some answers. Chalk this up under the dangers of asking Luke a question.

What is North?

First, what is North, and why? How is that particular direction defined, and why is it useful?

The answer begins with a cool fact about rotations. Any constant rotation in three dimensions leaves an axis, or a line through space, fixed in place while everything else moves around it. Try rotating a few objects to see what this means. It's a law of geometry that this is so, and a law of physics that spinning objects prefer this kind of motion.

When objects experience constant rotation there is always a line through the object that doesn't move.

The earth spins (surprise!) and so there is an axis that is left fixed in space that passes through the center of the earth and picks out two special points where it intersects the surface. These two points are called poles, and we can use them to define directions.

Here's how the system works (it may be obvious, but I wouldn't be writing this if it weren't for pedantry): First name the two poles North and South. This is arbitrary as long as you keep them straight, which you can do using the fact that the stars look different in the northern hemisphere than in the south. Now at any arbitrary point on the earth, you can draw lines that directly connect that point and the two poles. Draw arrows on the lines that point towards the North pole. Now when you are walking towards the North pole on the most direct line you are walking North, and likewise for South. This is the definition.

How North is defined: at every point on earth, draw the most direct line to the north pole. The direction of the resulting lines is called North.

You might ask "how can I tell if I'm moving towards the pole if it is far away from me?" Good imaginary question! The answer goes back to the rotation of the earth. That fixed axis that has been useful also ensures that we can orient ourselves to a pole no matter where we are on the earth as long as we can see the night sky. If you look at the stars over a long period they appear to move because of the rotation of the earth. And the stars nearest the axis move the least. If you face the spot in the sky where the stars don't move at all then you are facing one of the poles. Useful!

A time lapse showing the rotation pole in the night sky. Courtesy http://www.eso.org/public/images/potw1534a/ via Creative Commons.

Even better, you can measure how far away you are from the pole with a pretty simple measurement.
Since the earth is a sphere you can measure locations with angles instead of miles or whatever. In particular, you might want to know the angle between your location and the pole. The diagram below hows how you can measure this angle simply by measuring the angle between the horizon and the pole spot in the sky.

You can find your latitude (how far you are from the North pole) by finding a pole star and measuring the angle it makes from the horizon.

All together, this is how North is defined, and why it is defined that way. (And I haven't even mentioned how the earth's magnetic field also runs North/South.)

Singularity (is a cool word)

If you've been practicing your pedantry you may have already noticed slight problem with the definition of North. There are two points on the earth where the definition fails. Since we defined North as the direction towards the North pole, North doesn't mean anything when you are standing at the pole! Also, since South is the shortest direction towards the South pole, every direction is South when you're at the North pole, so it's not defined either. Both poles have this problem.

A singularity is where a system of directions stops working. North and South both have singularities.

A point where a direction scheme fails like this is called a coordinate singularity. But note these singularities are not problems with the earth! You can cross the poles all you want. The problem is with the scheme we've defined for describing directions. You can go any direction from the poles or walk over them just fine, you'll just have problems describing what you're doing in the standard North/South scheme.

We'll come back to singularities later. But at least now you feel cool for reading something that legitimately uses the word singularity.

East!

Now that we've defined North, East is pretty easy. Face North, then point to the right at a 90 degree angle to the North/South line. That's it.

The direction East is defined at a point to be at a right angle relative to North at that point.

If, for some reason, you can't distinguish right and left, wait until night. When the sun rises in the morning, East is the direction that is perpendicular to North and points towards the sun.

Of course, this definition also has singularities. Because it is defined relative to the direction North, it fails anywhere that definition fails. So you can't go North at the North pole, and you can't go East either.

Are we done yet?

Now the original question, at long last! Both North and East have singularities where the directions become undefined. The difference between North and East is that North by definition leads you towards the singularity, while East by definition keeps you the same distance away. You can't go North forever because you will always eventually reach a point where North is meaningless. You can go East forever because you will never reach the singularity.

Can we do better?

One might ask if there is a better system, one that doesn't have confusing singularities so that all directions are defined everywhere. The answer is no, not really, because of something called the hairy ball theorem (seriously). But I'll leave that for another time.

Heh heh, hairy ball.