Thursday, January 14, 2010

Power law curve in surnames

There are purportedly six million unique surnames in the United States.

Think about that. Considering how many Smiths and Johnsons are running about, that means there are millions of surnames being clutched onto by only a handful of survivors. Indeed, the New York Times says that while 151,000 surnames were shared by a hundred or more Americans, four million were held by only one person. For some reason, I suspect that spelling typos are responsible for at least a million of those. How many census records, for example, accidentally point to the surname "Smioth" or "Wikkiams"?

A few years ago, I wanted to know how many Kohses there were in the United States. I suspected there were about 200 or so. I was able to obtain a tally sheet from the United States Census Bureau, and while they did not preserve the exact 1990 census count, I learned at least that "Kohs" was about the 56,229th most frequently occurring surname in America.

Recently, though, I went back to review that file, and it's been updated with more detailed year 2000 census data. They now tally over 150,000 surnames having more than 100 members. The "Kohs" name has dropped to (approximately, since many families share the same estimated number of members) 84,631st place. They estimate about 206 people with that last name in the United States -- very close to my personal guess of "about 200". Funny that I've already collected at least 30 of them on Facebook.

While poking through the first 1,000 surnames, though, I made a fascinating discovery. After I charted the data, it was clear that the top 1,000 American surnames follow a power law curve in terms of distribution.

Naïve as I may be, I momentarily thought I might have made an impressive discovery, but this is not the case, of course. Academic studies have already examined the power law properties of surnames in Puerto Rico, Japan, England and Wales, Korea, and doubtless many other countries, including the United States.

