Phineas and Name Uniqueness

It’s been a while since I posted to this personal blog — so long, in fact, that I have had a child! We named him “Phineas Charles Ball”. (Photos are on Flickr.) “Phineas” is a fairly unusual name — although it’s become more familiar lately — and this post is my exploration on how “weird” this name actually is, and how name uniqueness trends have been developing over time.

As many of you already know, one of the most useful sources for analyzing baby name trends in the United States is the baby name data published by the Social Security Administration. These data have become especially high quality as social security numbers have become ubiquitous (at this point almost all children acquire one at birth). What you might not have realized is that some great raw data files are also available that go beyond what the website provides — the only limitation in these is that names used less than five times in a given year are not reported (for privacy reasons).

The first thing I wanted to plot was what most of us have noticed — qualitatively if not quantitatively — names have been becoming more unique. First I calculated the diversity as Shannon entropy. (I did a bit of a hack though: because I was limited to names seen 5 or more times, I only calculated the entropy of the most common 90% of names in a given year. This was close to the maximum possible — by 2011 nearly 1 in 10 girls has a name seen less than five times!)

Another way to slice this data is to try to answer this question: “How many names are needed to cover half the population?” (Or 10%. Or 90%.)

In 1950 you could cover half the male population with just 24 names — in 2010 you needed 139. As a child I remember sadly eyeing prelabeled personalized souvenirs, knowing I wouldn’t find my name among the items. (This is especially true because my first name isn’t the most common spelling.) Selling this sort of prelabeled paraphernalia has become a lot more difficult — many more names are needed to cover the same fraction of the population!

Some observations…

  1. Name uniqueness hasn’t been increasing monotonically. Names seem to have become slightly less unique between 1910 and 1950. After 1950 uniqueness increased, and really took off in the mid-1980s.
  2. Girl names are more unique than boy names (you probably already noticed this). It may be interesting to note that boy names today are as unique as girl names were in the early 1990s.
  3. You should take the early data with a grain of salt: the total applicant data shows that not all US citizens received social security numbers (SSNs); especially few that were born before 1910. The program was created in 1935 and the legal uses of SSNs expanded gradually.

So Phineas’s name occurs in a context of increasing uniqueness: to have a rare name now is more common than it was when I was born, and much more common than when my parents were born. This particular name also happens to have become more popular lately. When we slice the data we find that in the latest years the uniqueness of “Phineas” is near 80th percentile — one in five boys has a rarer name. It’s a bit unusual, but it’s not a dramatic outlier.

I’ll close with a list of famous Phineas’s: Phineas Gage (a famous case of frontal brain damage), Finny in “A Separate Piece“, P.T. Barnum (P. = Phineas!), and Phineas Flynn from the cartoon “Phineas and Ferb“. Also oft misremembered as Phineas: Phileas Fogg in Jules Verne’s “Around the World in Eighty Days“. Chris’s favorite find is Phineas Ball (1824-1894), waterworks engineer and mayor of Worcester, MA.

23andme’s First Patent

Update, June 1: 23andme has added an addendum to their announcement. In particular, the addendum clarifies and seems to promise that the patent will not be enforced with respect to performing interpretations: “Other entities can present information about the genetic associations covered in our patents without licensing fees.” This is reassuring news and it’s great to see 23andme outline such a limitation on patent enforcement! It allays my feared hypothetical situation regarding a “swiss cheese” effect on genome interpretation efforts (described below). — Madeleine

This morning I noticed a post from 23andme’s blog last night: Anne Wojcicki announced that 23andme expects to be awarded its first patent today. It touched on a lot of issues I care about, so I’ve written this personal post in response to it.

From what I understand, the 23andme patent seems to be a patent on genetic variant interpretation: in specific, on the interpretation of some variants (including one in the gene SGK1) as being associated with differences in an individual’s risk of developing Parkinson’s disease. Technical methods for determining the variants are listed, but they seem to be an enumeration of all extant methods for assessing genetic variants (including techniques used in whole genome sequencing).

In other words: this seems to be a patent regarding the reporting and usage of an observation that a naturally-occurring genetic variant is associated with a particular trait. As noted by 23andme’s announcement, these patents are controversial.

Patent Wars

While my first love is genetics, I am also a programmer — and in software, patents are very broadly hated by programmers. This American Life has an excellent episode documenting the tangled mess that is the software patent industry. It has become an arms race; even the most well-intentioned companies feel obligated to build up patent arsenals. Software patents are a different beast to biotechnology patents, but in some ways larger issues remain true: applied too broadly, in a field of rapid progress, patents have the potential to create a tangled web of litigation. The intended purpose of patents to protect innovation and encourage commercialization through exclusive access to innovation has instead become outright warfare.

A web of litigation in the mobile phone industry. ©2010 George Kokkinidis / Design Language, used with permission

I worry that this vision of patent warfare could exist in the realm of genome interpretation. The multitude of patents on the meaning of genetic variants seems to make the process of whole genome interpretation almost impossibly hazardous. I think it is vital to everybody that we are able to not merely return your “A’s, C’s, G’s, and T’s”, but also give you explanations like “you have A here, and according to these studies this means you are much less likely to be infected with stomach flu”. Will each one of those explanations run the risk of violating a patent? Will genome interpretations become like Swiss cheese as they must carefully avoid mentioning each of the patented genes (which are possibly the most important ones)? Is part of 23andme’s purpose here to build up its own arsenal of interpretations, as both defense and weapon against other interpretation efforts?

Will patents on the observed associations of genetic variants turn whole genome interpretation efforts into swiss cheese? Image credit: Madeleine Price Ball, CC-BY-SA

23andme is far from the first in this field (there are hundreds or thousands of patents like this one) and it is possible that they have no intention to engage in such wars. Nevertheless, as far as I am aware they have not released an assurance that the patent will not be used in this way (of course, neither has anyone else). In the software industry, some groups have made assurances regarding their patents — promises that the patents will only be used for defensive purposes (e.g. Twitter) or limits on their offensive uses (e.g. Red Hat). That said, such promises are easily broken.

Also troubling to me is the exact wording in the announcement itself:

“We believe patents should not be used to obstruct research or prevent individuals from knowing what’s in their genome. We believe that everyone has a right to know their genomes — their sequence of As, Ts, Cs, and Gs — and should be able to access them should they want to. This has been our guiding principle since day one, and 23andMe has pioneered the ability for individuals to have unfettered access to their genomes.”

I’m reading between the lines, but… if access to your genome means that you only have access to the uninterpreted sequence of A’s, T’s, C’s, and G’s — a completely unintelligible mess to the vast majority of humanity — then I think that falls short of “unfettered access”.

Patenting Nature

There is an important difference between software patents and gene interpretation patents. While software is clearly the product of design (hence the term “software engineer”), patents on the interpretations of genes are the product of discovery. Indeed, the word “discovery” dominates 23andme’s own announcement of the patent. As that announcement noted, whether this is patentable material is the subject of hot debate. Is this patenting a “law of nature”? While using the laws of nature is fundamental to any process, patent law has held that the “laws of nature” themselves are not patentable.

I am a researcher and not a lawyer, but I’ll try to summarize my understanding of the recent “Prometheus” case referenced by 23andme’s announcement. In a unanimous decision, the Supreme Court struck down the patentability of the act of monitoring the levels of a drug metabolite (the product of the drug as the body breaks it down) and the use of this information to adjust dosage of that drug. This correlation was held to be a “law of nature”, and therefore unpatentable. Some phrases from the decision that stood out to me were these:

“But to transform an unpatentable law of nature into a patent eligible application of such a law, a patent must do more than simply state the law of nature while adding the words ‘apply it.'”

“… the claimed processes are not patentable unless they have additional features that provide practical assurance that the processes are genuine applications of those laws rather than drafting efforts designed to monopolize the correlations.”

Patenting the observed naturally-occurring traits associated with a naturally occurring genetic variant strikes me as a very similar “law of nature”. Perhaps even moreso — at least the drug itself was some level of non-natural engineering? This is far from resolved, however. The more relevant case — the “Myriad” case regarding a patent on BRCA variants and their associations with breast cancer risk — has been remanded to the Federal Circuit for reconsideration in light of the Prometheus case. I am optimistic that the act of reading and interpreting genetic variants will be held to be non-patentable, and that all my worries written here will be moot and forgotten …. but this remains to be seen.

Cashing In On Crowdsourcing?

The discoveries made by 23andme have come from their “23andWe” program — a crowdsourcing of scientific research. A recent Nature Reviews Genetics article describes such programs as “participant centered initiatives” — “tools, programs and projects that empower participants to engage in the research process”. Crowdsourcing is a powerful tool to rapidly meet a goal, and an exciting consequence of the internet’s transformational facilitation of connecting and communicating. But it holds some darker questions: to what extent does such a program exist to benefit the participant — and to what extent is the participant used as a resource to benefit the organization? Although the lines might be fuzzy to draw, the ownership and profit from user-generated data has become a clear motivation for companies (c.f. Facebook).

The Personal Genome Project has a lot of overlap with 23andWe in style. We want to collect similar information from participants — we ask people (if they are willing) to share information regarding their health and traits, as well as genome data. But there is also a key difference between the two projects: we do not hold this data privately for our own research. We release the data publicly for all others to see, and this is something we are uniquely able to do due to our open consent process. We want everyone — including our participants — to have as much access to the data as we do, and the same potential to make interesting discoveries.

As such, I see Personal Genome Project participants as very much our “peers” in this research endeavor. For this reason I prefer to use the phrase “peer production” rather than “crowdsourcing” to describe some aspects of our work (a term that can also be applied to projects like Linux and Wikipedia): not merely a project that solicits participant contributions, but one that genuinely shares those contributions as freely as possible.

A Python GEDCOM Parser

Excited by my discovery of Mayflower ancestry (or perhaps by the apparent confirmation that my genealogy records weren’t totally made up), I decided to contact other individuals on 23andme who were predicted to share DNA fragments with me and seek out other cases of family overlap.

The task was rapidly daunting! On 23andme there aren’t really any tools, the method of choice appears to be listing all surnames in one’s ancestry. The GEDCOM format genealogy file my father has documented is huge, I have currently have 384 ancestors in the document (and 163 surnames). Other genealogy buffs have similarly deep information — I slowly realized manually searching for overlaps between our lists was not at all practical.

My first “quick and dirty” attempt was to grep the file for last name matches. Little did I realize there are actually 1,547 individuals in my file! People who are not my direct ancestor (cousins and their spouses and children) are listed as well. On one hand this was really cool, more data is better… but on the other hand it meant a lot more thought was required.

To cut a long story short, I ended up finding an old GPL-licensed Python GEDCOM parser (linked here as “GEDCOM Parser”). I extensively improved it (in my humble opinion) and have uploaded the code to github as “python-gedcom“. The end result was a module I can use to pull out direct ancestors, search on last name matches, and return the path between me and a given ancestor.

Applying this to a new 23andme match (who also had 160 surnames!) I found 27 potential surname matches among my ancestors — all in the New England area. (This might simply reflect that my New England ancestors are the most extensively documented region in my tree.) I sent my distant relative this list of names, along with dates & locations of birth & death (where available).

From that, he found one definite overlap. Here’s the path from me to that ancestor, nine generations distant:

Gen 0  Madeleine Emily Price
Gen 1 . Paul Arms Price
Gen 2 .. Doris Madeline Arms
Gen 3 ... Howard William Arms
Gen 4 .... Jane Aitken
Gen 5 ..... Eliza Wales
Gen 6 ...... John Wales
Gen 7 ....... Lucy Strong
Gen 8 ........ Martha Stoughton
Gen 9 ......... John Stoughton

To be fair, I think it’s possible (even probable) that given our shared New England ancestry we have other points of overlap that we didn’t discover. I’m really pleased, though, at how tractable this task became once a program was applied: programming is a useful skill to have!

The Invisible Privilege of Not Being a Black Man

Mostly I post about science, but there was a very good Morning Edition item I’d like to share. It aired a couple days ago in the wake of the tragedy of Trayvon Martin’s death. I think it did an excellent job of discussing the topic, avoiding (justifiable) anger to provoke simple empathy. Because it’s a very emotionally-laden topic, I highly recommend you actually listen to it, not just read the transcript:

A mom’s advice to her young black sons

In light of the shooting death of Florida teen Trayvon Martin, Steve Inskeep speaks with writer Donna Britt and her sons Justin and Darrell Britt-Gibson about how she prepared them as young black men for a world that might view them with suspicion.

Yesterday, in his first time commenting on the case, Obama said: “If I had a son, he’d look like Trayvon.” That cuts to the painful core of this. People are more likely to believe a black man is dangerous and criminal. It’s not limited to “The South” — mistaken criminal assumptions of minorities happens in Boston and London. The consequences range from frustrating to horrifying.

The phrase “invisible privilege” has been used to describe a benefit that people on the favored side of a social divide aren’t usually conscious of. The story of Trayvon’s death brings my attention to a privilege I’m usually unaware of: I don’t have to live with this fear — the fear that my son or brother or husband could be mistaken for a dangerous criminal… and die for that mistake.

Genetics, the Mayflower, and Me

A distant cousin

I was recently contacted by a distant relative on 23andme, based on a shared last name in our family trees (you can list a set for others to see) and a shared fragment of DNA. We were able to trace our connection to ancestral siblings born in the 1720’s (eight generations between me and the parents of these siblings!). My distant cousin told me more about that branch of the tree — the mother of both siblings can be traced back another three generations to Constance Hopkins, a 14-year-old passenger on the Mayflower and daughter of Stephen Hopkins. There were only 102 passengers on the Mayflower, which sailed in 1620, and half of them died the first winter. My great-great-great-great-great-great-great-great-great-grandmother Constance survived, however, and had a dozen children with Nicholas Snow.

Genealogy just got real

One amazing thing about this connection is that the shared DNA proves, beyond reasonable doubt, that our genealogies are solid for every link in our genealogies descending from those 1720’s siblings! Non-paternity events are common enough to make me suspicious of family trees tracing into the distant past: with five father-to-child links in those eight generations, how sure could I have been that all those links were honest? If you assume a non-paternity rate of 10%, there’s only a 60% chance none of those generations had a little “shenanigans” (.9**5 = 0.59). A more optimistic non-paternity rate of 4% raises the estimate to 80% (.96**5 = 0.82), but you can see why it’s hard for me to get too excited about these things.

Now this genealogy is real to me in a way it wasn’t before — before this distant cousin contacted me it was just a hypothetical set of historical records, but now I can be quite confident that the links back to that pair of siblings are a true history.

That leaves three unconfirmed generations between me and and Constance, only one of which was a father-to-child link (potential non-paternity event). It sounds like I can be fairly confident in stating I am a descendant of an original Mayflower passenger!

What are the chances of that anyway?

The Mayflower passengers were a tiny number of people …but… as Constance’s own records show, those that survived were extremely prolific. According to Wikipedia, between 1640 and 1790 the population of New England grew from ~13,700 to ~900,000 with almost no influx of immigration.1 Assuming an average of 30 years per generation (which is true for my own ancestry to Constance Hopkins) that’s five generations: each individual in 1790 had about 32 ancestors from 1640. That’s a bit high — mixing isn’t perfect — and I’m not sure we can trust that *no* immigration occurred. To be conservative let’s assume there’s effectively an average of 24 ancestors from 1640’s New England for each individual in 1790.

How many people in 1640 (out of 13,700) were Mayflower passengers or descendants? I’m guessing that the 50 survivors of the 1620 trip could’ve grown to 100 settlers & descendants by 1640. That means about 0.73% of New Englanders in 1640 were Mayflower descendants.

Putting these numbers together, we would predict 1 in 6 New Englanders in 1790 would have at least one of their twenty 1640’s ancestors from the Mayflower (1 – 0.9927**24 = 0.16 =~ 1/6). The total US population at the time was 3.9 million, so implies around 1 out of 26 US residents at that time was a Mayflower descendant (6 * 3,900,000 / 900,000 = 26).

At this point I’ll give up on guesswork. Immigration starts to play a stronger role in the growth of the United States after this point, but those immigrants mixed with the existing population. With seven more generations worth of mixing, the fraction of Mayflower descendants could easily be higher than it was in 1790. I’m guessing a fair number of people in the US can trace some ancestry to the early New England — and if they can, I think there’s a good chance they have an ancestor descended from that tiny initial group that came over on the Mayflower.2

1The Wikipedia paragraph referenced gives the 1790 population as 700,000, but I decided to use the slightly higher numbers from the table below (“Estimated Population of American Colonies 1620 to 1780”). 1790’s number of 900,000 is based on an estimated growth rate of 2.32% per year, which is based on the New England population growth from 1740 to 1780 in that chart.

2As I have been recently reading 1493, I’m keenly aware of how these settlers can also be described as invaders who took advantage of a collapse in the native populations. I don’t think Plymouth and the first Thanksgiving should be romanticized, nor should I feel unfair guilt over it (it’s hardly my fault, and most of the native collapse happened before they got there), but the history of Plymouth is a bit more real to me now.

The most astounding and poetic fact I know about the world

A recent hubbub was stirred when Miley Cyrus tweeted a link to a photo & quote from Lawrence Krauss — a reflection on our common origin as stardust (…and also had a comment somewhat dismissive towards a religious figure). To him it was the most poetic thing he knew about the universe.

I thought I’d share a fact I find at least as astounding and poetic:

Butterfly on flower, by Ben124.

When I watch a butterfly resting on a flower, I know that the instructions that made that flower, the instructions that made the butterfly, the instructions that made me — even the instructions of the invisible and ubiquitous bacteria

These instructions all speak the same language.

The Genetic Code. CC-BY-SA, derived from this.

Every single living thing uses the same DNA, the same genetic code, the same arbitrary correspondence of how to build proteins out of amino acids.

It has been four billion years since me and bacteria parted ways, but I can still take a piece of my own instructions and place it inside a bacteria, and the bacteria can read it, it can use it.

The sheer improbability that the parts making me can be used by a bacteria astounds me. That the instructions creating every form of life are written in the same language is a deeply powerful demonstration of our common origin: we are all distant cousins.

John Lauerman on Science Friday tomorrow!

John Lauerman (a.k.a. PGP16) will be on Science Friday tomorrow talking about his experience with the Personal Genome Project!

Genetic Test Reveals Unexpected Data (Science Friday)

For those who don’t know what I’m talking about, I recommend you read my recent PGP blog post about it and his own article — it’s possibly the most interesting case we’ve had to date:

Unexpected scary findings: the tale of John Lauerman’s whole genome sequencing (Personal Genome Project Blog)

DNA Nanorobots!

Congrats to Shawn & Ido! How can you go wrong with robots. A different approach to “DNA computing” — I find this way cooler than logic-gated genes, but maybe that’s just me.

And congrats to George — see, I named all the authors right there. Three authors! THREE AUTHORS! The technology is sweet too but I don’t know how many stars have to align to get such a short author list in a Science/Nature biotech article these days.

A Logic-Gated Nanorobot for Targeted Transport of Molecular Payloads (Science)

We describe an autonomous DNA nanorobot capable of transporting molecular payloads to cells, sensing cell surface inputs for conditional, triggered activation, and reconfiguring its structure for payload delivery. The device can be loaded with a variety of materials in a highly organized fashion and is controlled by an aptamer-encoded logic gate, enabling it to respond to a wide array of cues. We implemented several different logical AND gates and demonstrate their efficacy in selective regulation of nanorobot function. As a proof of principle, nanorobots loaded with combinations of antibody fragments were used in two different types of cell-signaling stimulation in tissue culture. Our prototype could inspire new designs with different selectivities and biologically active payloads for cell-targeting tasks.

Personal Genome Project Blog

A little belated, but for anyone interesting in the Personal Genome Project and hasn’t already heard — we’ve created a blog! We’re hoping to post something at least once a week, although sometimes it’ll be a small update.

The blog’s name hasn’t been decided yet, the first post offers a vote on some ideas & suggestions in comments. Jason also posted an announcement of the GET conference, and I posted a slightly edited version of my X-carrier status analysis.

Personal Genome Project Blog