Showing posts with label Science. Show all posts
Showing posts with label Science. Show all posts

06 January 2017

On Stereotypical Names

Because I am the kind of person that I am, I recently started to wonder if I could objectively determine the most stereotypical name associated with each state.
We all know that there are certain names from each state that are just so… state, you know? Names like Tyson Nielsen, of UT, Or Brendan Sullivan, MA. It’s a fun game to sit around with friends and try to think up the most stereotypical name for the states we love to tease.
‘But can’t math tell us more?’ I said to myself. ‘If math is good for anything, it must be able to help me make fun of people more effectively. But how?’
And so I set out to find a way to quantify how “state-y” a given name in a given state is. If I could find some good mathematical measure of “state-iness” I could run the formula over a large collection of census information and find the most “state-y” name for all the states. I did both of these things, and here is a summary of my attempt.

The Maps

Here are my official v1.0 maps of the most state-y names in each state, one for each sex. For personal relevancy, I restricted my analysis to names of people who were age 20-30 in 2010.
Map of most distinctive boy names
Most distinctive baby boy names born between 1980 and 1990
Map of most distinctive girl names
Most distinctive baby girl names born between 1980 and 1990

The Measure

It’s not immediately obvious how to measure statiness. Whatever we do, we should somehow capture the notion that the most state-y name is the one that is the most common in the state, without being common in the rest of the country. If 90% of people in every state are named Michael, we don’t want Michael to be the most state-y name for any of the states because it’s equally common everywhere.
On the flip side, we don’t want the one weird guy named “Zapron” in Montana to represent Montana. There may be more Zaprons in Montana than anywhere else, but there’s not enough to make it stereotypically Montana-ish.
My approach was statistical. I compare the abundance of a particular name in the given state to the abundance you’d expect if names were distributed mostly evenly and randomly across the country. When the actual abundance is much higher than the expected random abundance, the name gets a high state-iness rating in that state.
In theory this takes care of both Michael and Zapron. Because the abundance of Michaels is the same in every state no one Michael sticks out. Zapron is taken out because the measure takes into account small statistical fluctuations. The expected number of Zaprons in any state is very small, but finding one is not a terribly unlikely fluctuation, so it is discounted as well.

The Math

My mathematical model is as follows. The statiness S of a name is the negative log-likelihood that a name appears n_s times in the state according to a Poisson distribution with a mean that matches the national average, adjusted for state size, normalized against the likelihood of attaining the expected value.
S = \log(n_s!)-\log(\overline{n}_s!)-(n_s-\overline{n}_s)\log(\overline{n}_s)
\overline{n}_s = n_{US}\times \frac{s}{US}
As I was searching for data to run my analysis on I found a map made by someone with a similar goal. He mapped the most distinctive surnames in each state, using a slightly different measure than mine. His formula was
S' = \frac{n_s}{s} - \frac{n_{US}}{US}
The idea is roughly the same: find names which are more likely to be found in the particular state than the nation as a whole. In principle I can’t think of anything really wrong with this metric, but in practice I liked the results I got with my metric better. I think my metric produced more distinctive names, at least from the data I had. (Also subtracting probabilities is just wrong!)
As a note, I also tried a model where I used the negative log-likelihood from a binomial distribution of finding n_s people with a name after s trials with an individual probability of n_{US}/US. This yielded mostly the same results as the Poisson method, and I used the Poisson one to generate the results below.

The Data

It turns out the Social Security Administration has some great datasets. For example, they have a dataset with baby given names broken out by state, year, and sorted by frequency for all years from 1916 to 2000. Jackpot!
The main caveat with this data is that it doesn’t list names with fewer than 5 appearances in any given state in a given year. The immediate issue, that we are missing potentially high-statiness names, isn’t huge because our metric discounts infrequent names. However, my code assumes that that the number of names in each state and in the whole country is the same as the sum of all the names listed in the data. If there are too many unlisted names in every state, this assumption is wrong and that skews my analysis. Like a good scientist, I chose to ignore this issue and hope for the best.

The Commentary

I’m happy that these results seem to actually be interesting. There’s possibly room for improvement, but this is a good start.
The first thing I notice (besides the fact that the metric really loves ‘Tyler’), is that the results show geographic correlation. The Deep South loves William, the Northwest loves Tyler. All the usual geographic regions share some distinctive names. This suggests that the metric is working; the distinctive names are capturing cultural groupings that we already know about.
In the same vein, there are some names with obvious explanations. We know exactly why Spanish names dominate the Mexican border states, and why DC and MD’s lists turn out the way they did. This is more assurance that the names correspond to real world effects.
The big question is whether this is giving us good stereotypical names. The only real way to tell is to compare against my pre-existing stereotypes. Happily, the metric appears to work pretty well for the states I know. I imagine anyone who knows Utah is looking at “Tyson, Skyler, Trevor” and thinking “Of course!” Similarly, seeing “Brendan” at the top of MA’s list is pleasing the evil little stereotyper that lives in my brain. That said, I need people who are familiar with other states to help increase my confidence that these are good stereotypical names.
Interestingly, the girl names look a bit different. There seems to be less close geographical correlation. Look at all the places “Sarah” and “Amber” appear, even at the top of the list. I don’t have a good explanation for this, but I suspect that it’s a real effect and not an artifact of my metric because it shows up for one sex but not the other. Overall I am less confident that the girl names here represent good stereotypical state names, but I think that it’s maybe because of the way people name girls as opposed to something else.

The Future

There is plenty more to do with this data and analysis. I can increase the age range, or look into past naming trends. I am interested to see whether the geographical homogeneity of the girl names decreases as we go back in time.
We can also use a similar analysis to look at trends over time, and confirm once and for all that Ethel is an old person name.
I welcome suggestions for analyses to run. Also, here is the Python code I used in my analysis, in case someone wants to run their own analyses on the data.

24 September 2016

Science and Persuasion

This short essay was originally an answer to a question on Quora asked by some anonymous student. I think it is one of the best things I've written there, and is worth saving over here in a different format.

The question and it's context were this:
Which are the best method to convince a religious person of a fact proven by science? I told my religion teacher that in the first five weeks of development of a fetus, "we are all females" but she denied everything. How can I convince her to believe in science?

PS: I know that the Y chromosome determines the sex of a person and that only the appearance of a male and a female fetus is similar in the first five weeks.

(There's a little confusion in my response and the question because the questioner added the postscript while I was writing. I then edited my answer to account for that. You'll figure it out.)

Here was my response.

Science and Persuasion

Convincing anyone of anything through a single conversation alone is nearly impossible. It almost never happens. It is even rarer when topics get controversial. Do not expect to ever find a "magic bullet" for convincing someone that you are right.

That said, let me offer you three points of advice for scientific persuasion. They won't allow you to win any argument, but I doubt you'll find much success without them.

1. Get your facts right.

First, you must get your facts right: according to Sexual differentiation in humans and everything I ever learned in school, human sex is determined immediately at conception depending on the chromosome carried by the sperm cell. Thus, in a very fundamental way we are not all female until five weeks. That five week number is just how long it takes for fetuses to start developing sexually-differentiated organs.

If your whole argument depends on getting someone to accept the authority of science but you get your facts wrong you both lower your own credibility and the credibility of the authority of science in future arguments.

(I see you have edited your question details to note that you know this bit about chromosomes. If you know this, why do you think that your interpretation of what "female" means is the correct one? Why do you think your teacher's disagreement means she "doesn't believe in science"? This is crucial, and I deal with it in my third point below. In any case, the first point still stands in general.)

2. Use scientific reasoning, not appeals to scientific authority.

Science is not a collection of facts. It is a collection of observations together with interpretations and arguments about those observations. If you aren't able to make this distinction clear you will misrepresent the scientific process and your arguments will be less convincing.

Remember that it takes a long time and a lot of arguing before anything resembling "scientific consensus" is established. Most of the "facts" we learn today in school are the result of decades or centuries of extremely smart people disagreeing about observations until they reach a mutually agreeable interpretation. And even then sometimes new observations arise that invalidate the previous interpretation and the process starts all over.

As an example, let's consider the question of heliocentrism vs. geocentrism. That the planets go around the sun is now something "everyone knows," but it took the brightest minds of the renaissance about 100 years to convince themselves of this "fact." (Not to mention that the ancient Greeks argued about the same thing and couldn't agree.) And it took another 200 years after that to make observations that finally showed conclusively that the heliocentric model is inconsistent. 

Do you know why these scientists came to the conclusion that they did? Do you know the observations and arguments that they made that eventually convinced them? Do you know what the evidence that rules out the heliocentric model is? Could you explain these arguments to someone who doesn't know them? If you try to convince someone of geocentrism without knowing all of this, are you really convincing them of science?

But if you can't explain the reasoning that supports a "scientific fact" then your argument will sound like "Well my priest says your priest is wrong, so you should listen to me." Even if you are ultimately right, you won't be able to make a convincing argument.

3. Make sure you are arguing about facts, and not interpretations of facts.

This is the most difficult point, but also the most crucial. 

This is more difficult than it sounds because you may be arguing for an interpretation and think you're arguing for a fact. In your example, are you trying to convince your teacher that "fetuses do not develop sexually differentiated organs until five weeks" or that "fetuses are all female until five weeks (and therefore biological sex is not fundamental to human identity)"? These are very different claims and require very different kind of arguments! If you are not clear in your own mind what you are arguing, you will have a hard time convincing anyone of anything.

And even if you are certain what you are actually arguing, the other person may not be arguing about the same thing! What we call "facts" are almost always tied up with interpretations.

Imagine that we meet someone who wants to convince you that average African American IQ scores are 15 to 18 points lower than White Americans. My guess is that you, like me, have some immediate reservations about this person. Your first thought may be something like "Okay, but why do you want to convince me of this? Next you're probably going to try to convince me that blacks are inherently inferior to whites or some other racist claim." We may or may not dispute the immediate fact, but there are a whole lot of interpretations that are closely related to the fact that you strongly disagree with, and so hearing the fact puts us on edge. You may even be tempted to argue against the fact so that you don't have to argue over the potentially racist interpretations. If you do that, won't you look like "you don't believe in science?"

This sort of thing happens all the time! I highly recommend reading this blog post/research paper [1]  about how often conversations and surveys that look like they are about science knowledge are actually about religious belief (or the lack thereof). It examines some large scale surveys of science literacy/religious belief, especially about evolution, and concludes that
  "That work shows that there isn't relationship. What people say they “believe” about evolution is a measure of who they are, culturally. It’s not a measure of what they know about what’s known to science."
and
When subjects who are highly science literate but highly religious answer “False” to the NSF Indicator’s Evolution item, their response furnishes no reason to infer that they lack knowledge of the basic elements of the best scientific understanding of evolution.
and
For respondents who are below average in religiosity, a high score in “science literacy” predicts a higher probability of “believing” in “Naturalistic Evolution”—and so does a low score!
That is, when it comes to certain subjects, even if you think you are talking about "science facts" other people are talking about what they believe about religion. Even people who give the scientifically correct answer about these topics may not actually know anything about the science, but are telling you that they have low religiosity. (These, I suppose, are people who were convinced of scientific authority, but not scientific reasoning.)

This is very, very difficult to overcome. If you really want to convince someone of a scientific fact, leaving interpretation alone, you will have to put in tremendous effort to convince them of that. You must convey to them "I am not trying to attack your fundamental values. I recognize that there are many possible interpretations of this data. I really do just want to point out this scientific observation. I respect your values and interpretations, even if I disagree with some of them. If I can get you to agree to this scientific fact, I am not then going to use it against you to make your other beliefs look foolish."

This is hard. Maintaining trust in conversation when interpretations get controversial is one of the hardest tasks there is. It requires trying to really understand your conversational partner and their values. It requires asking lots of questions, and listening sincerely. If you assume that because someone disagrees with you that means they "don't believe in science" you have failed at this, and you will not find success until you can overcome this bias.

But it is possible. People can come together and learn from each other, and we should never stop trying. When we succeed, it is one of the peaks of human achievement. It is worth spending your life cultivating this skill, and I commend you trying to learn it.

Footnotes