Category Archives: Uncategorized

A Python GEDCOM Parser

Excited by my discovery of Mayflower ancestry (or perhaps by the apparent confirmation that my genealogy records weren’t totally made up), I decided to contact other individuals on 23andme who were predicted to share DNA fragments with me and seek out other cases of family overlap.

The task was rapidly daunting! On 23andme there aren’t really any tools, the method of choice appears to be listing all surnames in one’s ancestry. The GEDCOM format genealogy file my father has documented is huge, I have currently have 384 ancestors in the document (and 163 surnames). Other genealogy buffs have similarly deep information — I slowly realized manually searching for overlaps between our lists was not at all practical.

My first “quick and dirty” attempt was to grep the file for last name matches. Little did I realize there are actually 1,547 individuals in my file! People who are not my direct ancestor (cousins and their spouses and children) are listed as well. On one hand this was really cool, more data is better… but on the other hand it meant a lot more thought was required.

To cut a long story short, I ended up finding an old GPL-licensed Python GEDCOM parser (linked here as “GEDCOM Parser”). I extensively improved it (in my humble opinion) and have uploaded the code to github as “python-gedcom“. The end result was a module I can use to pull out direct ancestors, search on last name matches, and return the path between me and a given ancestor.

Applying this to a new 23andme match (who also had 160 surnames!) I found 27 potential surname matches among my ancestors — all in the New England area. (This might simply reflect that my New England ancestors are the most extensively documented region in my tree.) I sent my distant relative this list of names, along with dates & locations of birth & death (where available).

From that, he found one definite overlap. Here’s the path from me to that ancestor, nine generations distant:

Gen 0  Madeleine Emily Price
Gen 1 . Paul Arms Price
Gen 2 .. Doris Madeline Arms
Gen 3 ... Howard William Arms
Gen 4 .... Jane Aitken
Gen 5 ..... Eliza Wales
Gen 6 ...... John Wales
Gen 7 ....... Lucy Strong
Gen 8 ........ Martha Stoughton
Gen 9 ......... John Stoughton

To be fair, I think it’s possible (even probable) that given our shared New England ancestry we have other points of overlap that we didn’t discover. I’m really pleased, though, at how tractable this task became once a program was applied: programming is a useful skill to have!

The Invisible Privilege of Not Being a Black Man

Mostly I post about science, but there was a very good Morning Edition item I’d like to share. It aired a couple days ago in the wake of the tragedy of Trayvon Martin’s death. I think it did an excellent job of discussing the topic, avoiding (justifiable) anger to provoke simple empathy. Because it’s a very emotionally-laden topic, I highly recommend you actually listen to it, not just read the transcript:



A mom’s advice to her young black sons


In light of the shooting death of Florida teen Trayvon Martin, Steve Inskeep speaks with writer Donna Britt and her sons Justin and Darrell Britt-Gibson about how she prepared them as young black men for a world that might view them with suspicion.

Yesterday, in his first time commenting on the case, Obama said: “If I had a son, he’d look like Trayvon.” That cuts to the painful core of this. People are more likely to believe a black man is dangerous and criminal. It’s not limited to “The South” — mistaken criminal assumptions of minorities happens in Boston and London. The consequences range from frustrating to horrifying.

The phrase “invisible privilege” has been used to describe a benefit that people on the favored side of a social divide aren’t usually conscious of. The story of Trayvon’s death brings my attention to a privilege I’m usually unaware of: I don’t have to live with this fear — the fear that my son or brother or husband could be mistaken for a dangerous criminal… and die for that mistake.

Genetics, the Mayflower, and Me

A distant cousin

I was recently contacted by a distant relative on 23andme, based on a shared last name in our family trees (you can list a set for others to see) and a shared fragment of DNA. We were able to trace our connection to ancestral siblings born in the 1720′s (eight generations between me and the parents of these siblings!). My distant cousin told me more about that branch of the tree — the mother of both siblings can be traced back another three generations to Constance Hopkins, a 14-year-old passenger on the Mayflower and daughter of Stephen Hopkins. There were only 102 passengers on the Mayflower, which sailed in 1620, and half of them died the first winter. My great-great-great-great-great-great-great-great-great-grandmother Constance survived, however, and had a dozen children with Nicholas Snow.

Genealogy just got real

One amazing thing about this connection is that the shared DNA proves, beyond reasonable doubt, that our genealogies are solid for every link in our genealogies descending from those 1720′s siblings! Non-paternity events are common enough to make me suspicious of family trees tracing into the distant past: with five father-to-child links in those eight generations, how sure could I have been that all those links were honest? If you assume a non-paternity rate of 10%, there’s only a 60% chance none of those generations had a little “shenanigans” (.9**5 = 0.59). A more optimistic non-paternity rate of 4% raises the estimate to 80% (.96**5 = 0.82), but you can see why it’s hard for me to get too excited about these things.

Now this genealogy is real to me in a way it wasn’t before — before this distant cousin contacted me it was just a hypothetical set of historical records, but now I can be quite confident that the links back to that pair of siblings are a true history.

That leaves three unconfirmed generations between me and and Constance, only one of which was a father-to-child link (potential non-paternity event). It sounds like I can be fairly confident in stating I am a descendant of an original Mayflower passenger!

What are the chances of that anyway?

The Mayflower passengers were a tiny number of people …but… as Constance’s own records show, those that survived were extremely prolific. According to Wikipedia, between 1640 and 1790 the population of New England grew from ~13,700 to ~900,000 with almost no influx of immigration.1 Assuming an average of 30 years per generation (which is true for my own ancestry to Constance Hopkins) that’s five generations: each individual in 1790 had about 32 ancestors from 1640. That’s a bit high — mixing isn’t perfect — and I’m not sure we can trust that *no* immigration occurred. To be conservative let’s assume there’s effectively an average of 24 ancestors from 1640′s New England for each individual in 1790.

How many people in 1640 (out of 13,700) were Mayflower passengers or descendants? I’m guessing that the 50 survivors of the 1620 trip could’ve grown to 100 settlers & descendants by 1640. That means about 0.73% of New Englanders in 1640 were Mayflower descendants.

Putting these numbers together, we would predict 1 in 6 New Englanders in 1790 would have at least one of their twenty 1640′s ancestors from the Mayflower (1 – 0.9927**24 = 0.16 =~ 1/6). The total US population at the time was 3.9 million, so implies around 1 out of 26 US residents at that time was a Mayflower descendant (6 * 3,900,000 / 900,000 = 26).

At this point I’ll give up on guesswork. Immigration starts to play a stronger role in the growth of the United States after this point, but those immigrants mixed with the existing population. With seven more generations worth of mixing, the fraction of Mayflower descendants could easily be higher than it was in 1790. I’m guessing a fair number of people in the US can trace some ancestry to the early New England — and if they can, I think there’s a good chance they have an ancestor descended from that tiny initial group that came over on the Mayflower.2



1The Wikipedia paragraph referenced gives the 1790 population as 700,000, but I decided to use the slightly higher numbers from the table below (“Estimated Population of American Colonies 1620 to 1780″). 1790′s number of 900,000 is based on an estimated growth rate of 2.32% per year, which is based on the New England population growth from 1740 to 1780 in that chart.

2As I have been recently reading 1493, I’m keenly aware of how these settlers can also be described as invaders who took advantage of a collapse in the native populations. I don’t think Plymouth and the first Thanksgiving should be romanticized, nor should I feel unfair guilt over it (it’s hardly my fault, and most of the native collapse happened before they got there), but the history of Plymouth is a bit more real to me now.

The most astounding and poetic fact I know about the world

A recent hubbub was stirred when Miley Cyrus tweeted a link to a photo & quote from Lawrence Krauss — a reflection on our common origin as stardust (…and also had a comment somewhat dismissive towards a religious figure). To him it was the most poetic thing he knew about the universe.

I thought I’d share a fact I find at least as astounding and poetic:


Butterfly on flower, by Ben124.

When I watch a butterfly resting on a flower, I know that the instructions that made that flower, the instructions that made the butterfly, the instructions that made me — even the instructions of the invisible and ubiquitous bacteria

These instructions all speak the same language.



The Genetic Code. CC-BY-SA, derived from this.

Every single living thing uses the same DNA, the same genetic code, the same arbitrary correspondence of how to build proteins out of amino acids.

It has been four billion years since me and bacteria parted ways, but I can still take a piece of my own instructions and place it inside a bacteria, and the bacteria can read it, it can use it.

The sheer improbability that the parts making me can be used by a bacteria astounds me. That the instructions creating every form of life are written in the same language is a deeply powerful demonstration of our common origin: we are all distant cousins.

John Lauerman on Science Friday tomorrow!

John Lauerman (a.k.a. PGP16) will be on Science Friday tomorrow talking about his experience with the Personal Genome Project!

Genetic Test Reveals Unexpected Data (Science Friday)

For those who don’t know what I’m talking about, I recommend you read my recent PGP blog post about it and his own article — it’s possibly the most interesting case we’ve had to date:

Unexpected scary findings: the tale of John Lauerman’s whole genome sequencing (Personal Genome Project Blog)

DNA Nanorobots!

Congrats to Shawn & Ido! How can you go wrong with robots. A different approach to “DNA computing” — I find this way cooler than logic-gated genes, but maybe that’s just me.

And congrats to George — see, I named all the authors right there. Three authors! THREE AUTHORS! The technology is sweet too but I don’t know how many stars have to align to get such a short author list in a Science/Nature biotech article these days.

A Logic-Gated Nanorobot for Targeted Transport of Molecular Payloads (Science)

We describe an autonomous DNA nanorobot capable of transporting molecular payloads to cells, sensing cell surface inputs for conditional, triggered activation, and reconfiguring its structure for payload delivery. The device can be loaded with a variety of materials in a highly organized fashion and is controlled by an aptamer-encoded logic gate, enabling it to respond to a wide array of cues. We implemented several different logical AND gates and demonstrate their efficacy in selective regulation of nanorobot function. As a proof of principle, nanorobots loaded with combinations of antibody fragments were used in two different types of cell-signaling stimulation in tissue culture. Our prototype could inspire new designs with different selectivities and biologically active payloads for cell-targeting tasks.

Personal Genome Project Blog

A little belated, but for anyone interesting in the Personal Genome Project and hasn’t already heard — we’ve created a blog! We’re hoping to post something at least once a week, although sometimes it’ll be a small update.

The blog’s name hasn’t been decided yet, the first post offers a vote on some ideas & suggestions in comments. Jason also posted an announcement of the GET conference, and I posted a slightly edited version of my X-carrier status analysis.

Personal Genome Project Blog

Cystic Fibrosis treatment — sometimes the specific variant matters

Great news for cystic fibrosis research! And also an interesting demonstration of how two different variants causing the same disease can end up having different clinical consquences.

Drug bests cystic-fibrosis mutation (Nature News)

In GET-Evidence (our genome/variant review system) we currently score each variant’s clinical effect individually. This tends to cause a lot of redundant labor — most of these scores (severity, treatability, and penetrance) are the same for all variants causing that disease. As we expand the system, one suggestion has been to add disease pages and have all variants refer to those pages for clinical importance scores. Generally this sounds like a good idea.

This case is an exception that shows how variant-level information will still be important. In this case CFTR-G551D is found to respond much more effectively to this treatment than CFTR-F508Del. (The latter is the one causing disease in most patients, unfortunately.) This means CFTR-G551D would score more highly on “treatability” than CFTR-F508Del. Even though both cause the same serious disease, the treatability of that disease depends on the particular genetic variant.

The decline of the microarray

Via David Nusinow on Google+:

When can we expect the last damn microarray paper? (http://jermdemo.blogspot.com/2012/01/when-can-we-expect-last-damn-microarray.html)

2016 for the last microarray paper? Sounds a bit optimistic, but we can dream. Of course there will likely be niche uses for microarrays for a while yet (e.g. cheap linkage analysis in a pedigree, as in my last post…) — the analysis here is just counting usage of the word in the title (presumably indicating the technology is a major focus of the paper).