CC0 all the media


20130123_DNA_chemical_structure

I’ve released as CC0 all the pictures I’ve created and shared on Wikimedia Commons. I’ve been thinking about doing this for a while; Aaron’s death and — more specifically — Nina Paley’s release of Sita Sings the Blues as CC0 have pushed me into doing it. I’ve encountered the same issues she has — people ask me for permission due to legal concerns when I don’t think they need to. In particular, my chemical structure of DNA diagram has been a popular item for textbooks.

I study body-books

Theo Sanderson has made a text editor that checks if a body of text complies with using only the 1,000 most common English words. This was inspired by XKCD’s “Up-Goer Five” — a description of the Saturn V rocket created according to this rule. It reads like a Simple Wikipedia article (but even more extreme).

Anyway, I’ve seen a couple friends describe their job using this constraint, so I figured I’d try my hand at it. It’s surprisingly intelligible, and I think I like the kenning of “body-book” to describe a genome.

I study body-books

Children often have bodies like their parents. One reason this is true is because we each have parts that tell our bodies how to grow. We get these parts from our parents, and they can be read like a book. I study these body-books.

Some body-books have words that cause people to grow in the same way. But sometimes people are different — even if their body-books have the same words — and so I also study what things make bodies different even if their body-books are the same.

We are able to study our body-books more than ever, because we can now read them very easily.

Another important thing about body-books: we think it will be possible to learn a lot from someone’s body-book, even if we aren’t able to do it now. Also, with computers it’s very easy to share body-books — and it’s very hard to hide them after they’re shared. This means if people give their body-books so others can study them, they might share things they didn’t know about and didn’t mean to share.

So another part of my job is making sure people learn this might happen. We want to share body-books with everyone so that everyone can study them, but only people who know the fears should share their body-books.

What can we do in Aaron’s wake?

The brother of my friend Noah Swartz committed suicide last Friday. I didn’t know Noah’s brother Aaron, so these are the terms I relate to it in. The Swartz family is close to many of my friends: Mako and Mika live in his Aaron’s former apartment/offices, and I’ve met both of Noah’s brothers through them. Noah’s a quiet guy, but a geek in his own right — crazy good at strategy games and an occasional host for college radio.

Noah’s brother was Aaron Swartz. Aaron’s in the news a lot right now, and with good reason. He was brilliant and he was unfairly treated. The Swartz family and Aaron’s partner aren’t going to have a lot of privacy these days, but I’m not sure they want it. They’re angry and they want you to know that Aaron’s death wasn’t just about depression:

“Aaron’s death is not simply a personal tragedy. It is the product of a criminal justice system rife with intimidation and prosecutorial overreach. Decisions made by officials in the Massachusetts U.S. Attorney’s office and at MIT contributed to his death. The US Attorney’s office pursued an exceptionally harsh array of charges, carrying potentially over 30 years in prison, to punish an alleged crime that had no victims. Meanwhile, unlike JSTOR, MIT refused to stand up for Aaron and its own community’s most cherished principles.”

It is difficult to explain what Aaron was actually prosecuted for, what he was facing, and why it was horribly wrong. The best summary I’ve heard so far was aired this morning on WBUR: http://www.wbur.org/2013/01/15/swartz-attorney-ortiz

It’s hard to know where to go from here. Here are some ideas.

  • Ask MIT for an apology. It’s too little and too late, but those who loved Aaron would like to see MIT acknowledge that its involvement in his prosecution was wrong.

  • Dedicate yourself to publishing Open Access. If you are in academia, you know what this is about. Aaron was convinced that knowledge is power, and our publications are purportedly our efforts to share knowledge. You may also wish to share copies of your pdfs on the web, and there is a Twitter movement advocating this (#pdftribute). I should note while this is common it is also technically illegal — an act of civil disobedience, albeit on a much smaller scale than Aaron’s alleged and unrealized liberation of JSTOR archives.

  • Give to Givewell. Aaron believed we have a moral obligation to help others in the most efficient manners possible. He personally worked for structural change — he was a genius and so he had a reasonable chance of accomplishing this — but he was also a strong believer in Givewell and doing the greatest good by contributing to the developing world. My husband Chris and I donate a significant fraction of our income each year to Givewell, and the Swartz family has asked that donations made in Aaron’s memory be made to that organization.

Finally, here are articles and links if you’d like to learn more about Aaron. I present these in chronological order.

Well, That’s Ironic

I’m lucky and grateful to have been recommended by George Church for Genome Technology’s Seventh Annual Young Investigators. The profile they wrote — “Madeleine Price Ball: Free the Data” — is really nice. Or at least it was, if I recall correctly. I talked about how important it is for scientists to share information freely (in particular, human genome and interpretation data).

How ironic is it that it’s behind a subscription block?

I had mixed feelings about the interview, as I knew this would happen. At least the GenomeWeb account doesn’t cost anything. It does, however, require a password containing at least one of each of the following: uppercase character, lowercase character, number, and punctuation. And… it does this all over “http”, not “https”. Since GenomeWeb is apparently encouraging you to send one of your favorite super-secure passwords all around the internets in plaintext, I’m reluctant to recommend making an account there.

Celebrating Seven Years with Seven Percent

(This is a joint blog post with Chris.)

Today is Giving Tuesday. It’s a great idea. Here in the US, something feels odd about following our national day of giving thanks (Thanksgiving) with the consumerism of Black Friday, Small Business Saturday and Cyber Monday. As we shop to find gifts for those we love, we feel it’s also important to celebrate giving to those we don’t know, who need it most. We hope this post inspires others to give more and to celebrate giving.

For several years now we’ve celebrated our wedding anniversary by giving a percentage of our yearly pre-tax income to charity — a percentage determined by the number of years we’ve been married. This year that percentage is 7%. Our 7th anniversary was October 29th, but we’ve waited to hear from our favorite source for charity advice, GiveWell, to make their yearly recommendations. Luckily they did this yesterday, giving us the opportunity to post this today.

This year we are closely following GiveWell’s advice and giving 90% of the 7% to three charities: GiveDirectly, the Against Malaria Foundation (AMF), and the Schistosomiasis Control Initiative (SCI). (The remaining 10% will be decided later, and will probably be advocacy and other nonprofits that may not be highly effective, but are close to our hearts.)


Loiturerei village, Kenya. Taken by UK DFID, CC-BY-SA.

50% to GiveDirectly (3.5% of our annual income)

GiveDirectly is GiveWell’s only new recommendation this year, and we think it’s one of the most interesting charities out there. Its method is simply this: find the poorest people in Kenya (here’s how they do that) and give them money through the M-PESA money network.

There are all kinds of reasons why simply giving money to poor people directly might not be the best we can do (they might spend it on something we’d rather they didn’t, for example) but it does avoid the money’s impact being diluted by corruption or overhead. More importantly, GiveDirectly will be quantifying how much it helps. They will follow up with the recipients over the next year — using a randomized control trial for which they’ve pre-published the survey and analysis plan.

We’re hopeful that better interventions exist than GiveDirectly. But we want their project to succeed because it shares the commitment to measuring outcomes that we think is vital, and it can serve as a baseline to compare other charities to in the future (i.e. “Can you do something that creates more improvement to lives than GiveDirectly? Prove it.”).

30% to Against Malaria Foundation (2.1% of our annual income)

AMF distributes insecticide-treated nets for protecting against malaria infection. GiveWell estimates the cost per life saved is just under $2,500. Malaria is not usually fatal, so there is also a fair amount of disability due to illness is also being prevented.

10% to Schistosomiasis Control Initiative (0.7% of our annual income)

GiveWell thinks that SCI — which concentrates on the “Neglected Tropical Diseases” (usually worms/parasites) — offers an extremely effective intervention at improving DALYs (see below). This is because the infections they focus on are readily treatable using very inexpensive drugs, yet often come with debilitating symptoms that don’t quite kill the “host”.


“For You!” By Nomadic Lass, CC-BY-SA.

Donating effectively

It’s hard to list all the reasons people choose to give, or do not. One issue we’ve seen raised is the belief that “charity doesn’t work”. We believe that simply isn’t true. It may be true for some — many — perhaps most! Government-managed foreign aid especially so: it’s only around 1% of the US budget and mainly goes to political allies. But there are non-governmental charities that demonstrate real improvements, and GiveWell supports these. Giving can work, but it’s important to find effective giving opportunities.

And for that reason, we waited for GiveWell’s latest recommendations. Givewell looks for organizations that maximize the improvement to lives caused by each dollar you’re giving. This seems like it should be uncontroversial, but it’s not yet common to think about giving this way. Perhaps one reason for this is that it requires a way to measure outcomes and compare them against each other, and that’s very difficult. GiveWell is doing a fantastic job trying to do this all the same, though, using tools like the Disability-Adjusted Life Year (which is a measure of health that’s better than just measuring how long people live), randomized control trials, and the kind of statistics knowledge you have when you’re a charity review organization that was founded by a bunch of ex-quants. (A Businessweek article referred to GiveWell as Hedge Fund Analytics for Nonprofits.)

A second reason people are sometimes reluctant to think about donating effectively in this way is that for most of us, it’s going to involve donating to people far away instead of in our local communities. The price of living here in Boston, MA is very high, both for rent and food — in contrast, more than a third of the people in the world live on less than USD $2/day (most people don’t realize that this number is adjusted for the purchasing power of goods and services in the US!). When trying to decide whether to donate locally or globally, it’s clear that our money can do much more good in other countries than here in the US.

A third reason that people are reluctant to give to maximize outcomes is that we don’t have the same emotional connection to people across the world as we do to an individual call from help from someone that we can see — counter-intuitively, studies such as this one show that people have a strong bias towards giving more money to help a single identifiable victim than to help many “statistical” victims. The Internet has helped to reduce the effects of this emotional bias, with sites like Kiva giving a name and face to the global poor. Perhaps GiveDirectly could benefit from adopting a Kiva-style interface itself.

Closing thoughts

Each year we ratchet up the amount we give, and this year has brought us a new financial development: our first child. When people learn about our annual tradition they wonder how it will scale — will we be doing this on our 20th? Our 50th? Our 101st? (We hope to have that last problem!) As Yogi Berra said, “It’s tough to make predictions, especially about the future.” We know the responsibilities of parenthood will demand more of our finances, and balancing that with wanting to help others will be a lifetime project. Tithing (10%) is a very common tradition, and we want to at least reach that. Maybe we can go beyond it. For now we’ll take it one step at a time, and try to give a little more each year.

Personal Genome Project talk at 2012 Open Science Summit

Finally I have a video to point people to if they’re at all curious about what I work on.

This is a talk about the Personal Genome Project that I gave at the 2012 Open Science Summit. It’s an overview of the PGP’s motivations and goals, with updates on recent progress.

Because I was the last speaker before an already-delayed lunch, it’s fairly fast-paced — the talk itself is only 12 minutes long. Hope you enjoy it!

Phineas and Name Uniqueness

It’s been a while since I posted to this personal blog — so long, in fact, that I have had a child! We named him “Phineas Charles Ball”. (Photos are on Flickr.) “Phineas” is a fairly unusual name — although it’s become more familiar lately — and this post is my exploration on how “weird” this name actually is, and how name uniqueness trends have been developing over time.

As many of you already know, one of the most useful sources for analyzing baby name trends in the United States is the baby name data published by the Social Security Administration. These data have become especially high quality as social security numbers have become ubiquitous (at this point almost all children acquire one at birth). What you might not have realized is that some great raw data files are also available that go beyond what the website provides — the only limitation in these is that names used less than five times in a given year are not reported (for privacy reasons).

The first thing I wanted to plot was what most of us have noticed — qualitatively if not quantitatively — names have been becoming more unique. First I calculated the diversity as Shannon entropy. (I did a bit of a hack though: because I was limited to names seen 5 or more times, I only calculated the entropy of the most common 90% of names in a given year. This was close to the maximum possible — by 2011 nearly 1 in 10 girls has a name seen less than five times!)

Another way to slice this data is to try to answer this question: “How many names are needed to cover half the population?” (Or 10%. Or 90%.)

In 1950 you could cover half the male population with just 24 names — in 2010 you needed 139. As a child I remember sadly eyeing prelabeled personalized souvenirs, knowing I wouldn’t find my name among the items. (This is especially true because my first name isn’t the most common spelling.) Selling this sort of prelabeled paraphernalia has become a lot more difficult — many more names are needed to cover the same fraction of the population!

Some observations…

  1. Name uniqueness hasn’t been increasing monotonically. Names seem to have become slightly less unique between 1910 and 1950. After 1950 uniqueness increased, and really took off in the mid-1980s.
  2. Girl names are more unique than boy names (you probably already noticed this). It may be interesting to note that boy names today are as unique as girl names were in the early 1990s.
  3. You should take the early data with a grain of salt: the total applicant data shows that not all US citizens received social security numbers (SSNs); especially few that were born before 1910. The program was created in 1935 and the legal uses of SSNs expanded gradually.

So Phineas’s name occurs in a context of increasing uniqueness: to have a rare name now is more common than it was when I was born, and much more common than when my parents were born. This particular name also happens to have become more popular lately. When we slice the data we find that in the latest years the uniqueness of “Phineas” is near 80th percentile — one in five boys has a rarer name. It’s a bit unusual, but it’s not a dramatic outlier.

I’ll close with a list of famous Phineas’s: Phineas Gage (a famous case of frontal brain damage), Finny in “A Separate Piece“, P.T. Barnum (P. = Phineas!), and Phineas Flynn from the cartoon “Phineas and Ferb“. Also oft misremembered as Phineas: Phileas Fogg in Jules Verne’s “Around the World in Eighty Days“. Chris’s favorite find is Phineas Ball (1824-1894), waterworks engineer and mayor of Worcester, MA.

23andme’s First Patent

Update, June 1: 23andme has added an addendum to their announcement. In particular, the addendum clarifies and seems to promise that the patent will not be enforced with respect to performing interpretations: “Other entities can present information about the genetic associations covered in our patents without licensing fees.” This is reassuring news and it’s great to see 23andme outline such a limitation on patent enforcement! It allays my feared hypothetical situation regarding a “swiss cheese” effect on genome interpretation efforts (described below). — Madeleine


This morning I noticed a post from 23andme’s blog last night: Anne Wojcicki announced that 23andme expects to be awarded its first patent today. It touched on a lot of issues I care about, so I’ve written this personal post in response to it.

From what I understand, the 23andme patent seems to be a patent on genetic variant interpretation: in specific, on the interpretation of some variants (including one in the gene SGK1) as being associated with differences in an individual’s risk of developing Parkinson’s disease. Technical methods for determining the variants are listed, but they seem to be an enumeration of all extant methods for assessing genetic variants (including techniques used in whole genome sequencing).

In other words: this seems to be a patent regarding the reporting and usage of an observation that a naturally-occurring genetic variant is associated with a particular trait. As noted by 23andme’s announcement, these patents are controversial.

Patent Wars

While my first love is genetics, I am also a programmer — and in software, patents are very broadly hated by programmers. This American Life has an excellent episode documenting the tangled mess that is the software patent industry. It has become an arms race; even the most well-intentioned companies feel obligated to build up patent arsenals. Software patents are a different beast to biotechnology patents, but in some ways larger issues remain true: applied too broadly, in a field of rapid progress, patents have the potential to create a tangled web of litigation. The intended purpose of patents to protect innovation and encourage commercialization through exclusive access to innovation has instead become outright warfare.

A web of litigation in the mobile phone industry. ©2010 George Kokkinidis / Design Language, used with permission

I worry that this vision of patent warfare could exist in the realm of genome interpretation. The multitude of patents on the meaning of genetic variants seems to make the process of whole genome interpretation almost impossibly hazardous. I think it is vital to everybody that we are able to not merely return your “A’s, C’s, G’s, and T’s”, but also give you explanations like “you have A here, and according to these studies this means you are much less likely to be infected with stomach flu”. Will each one of those explanations run the risk of violating a patent? Will genome interpretations become like Swiss cheese as they must carefully avoid mentioning each of the patented genes (which are possibly the most important ones)? Is part of 23andme’s purpose here to build up its own arsenal of interpretations, as both defense and weapon against other interpretation efforts?

Will patents on the observed associations of genetic variants turn whole genome interpretation efforts into swiss cheese? Image credit: Madeleine Price Ball, CC-BY-SA

23andme is far from the first in this field (there are hundreds or thousands of patents like this one) and it is possible that they have no intention to engage in such wars. Nevertheless, as far as I am aware they have not released an assurance that the patent will not be used in this way (of course, neither has anyone else). In the software industry, some groups have made assurances regarding their patents — promises that the patents will only be used for defensive purposes (e.g. Twitter) or limits on their offensive uses (e.g. Red Hat). That said, such promises are easily broken.

Also troubling to me is the exact wording in the announcement itself:

“We believe patents should not be used to obstruct research or prevent individuals from knowing what’s in their genome. We believe that everyone has a right to know their genomes — their sequence of As, Ts, Cs, and Gs — and should be able to access them should they want to. This has been our guiding principle since day one, and 23andMe has pioneered the ability for individuals to have unfettered access to their genomes.”

I’m reading between the lines, but… if access to your genome means that you only have access to the uninterpreted sequence of A’s, T’s, C’s, and G’s — a completely unintelligible mess to the vast majority of humanity — then I think that falls short of “unfettered access”.

Patenting Nature

There is an important difference between software patents and gene interpretation patents. While software is clearly the product of design (hence the term “software engineer”), patents on the interpretations of genes are the product of discovery. Indeed, the word “discovery” dominates 23andme’s own announcement of the patent. As that announcement noted, whether this is patentable material is the subject of hot debate. Is this patenting a “law of nature”? While using the laws of nature is fundamental to any process, patent law has held that the “laws of nature” themselves are not patentable.

I am a researcher and not a lawyer, but I’ll try to summarize my understanding of the recent “Prometheus” case referenced by 23andme’s announcement. In a unanimous decision, the Supreme Court struck down the patentability of the act of monitoring the levels of a drug metabolite (the product of the drug as the body breaks it down) and the use of this information to adjust dosage of that drug. This correlation was held to be a “law of nature”, and therefore unpatentable. Some phrases from the decision that stood out to me were these:

“But to transform an unpatentable law of nature into a patent eligible application of such a law, a patent must do more than simply state the law of nature while adding the words ‘apply it.’”

“… the claimed processes are not patentable unless they have additional features that provide practical assurance that the processes are genuine applications of those laws rather than drafting efforts designed to monopolize the correlations.”

Patenting the observed naturally-occurring traits associated with a naturally occurring genetic variant strikes me as a very similar “law of nature”. Perhaps even moreso — at least the drug itself was some level of non-natural engineering? This is far from resolved, however. The more relevant case — the “Myriad” case regarding a patent on BRCA variants and their associations with breast cancer risk — has been remanded to the Federal Circuit for reconsideration in light of the Prometheus case. I am optimistic that the act of reading and interpreting genetic variants will be held to be non-patentable, and that all my worries written here will be moot and forgotten …. but this remains to be seen.

Cashing In On Crowdsourcing?

The discoveries made by 23andme have come from their “23andWe” program — a crowdsourcing of scientific research. A recent Nature Reviews Genetics article describes such programs as “participant centered initiatives” — “tools, programs and projects that empower participants to engage in the research process”. Crowdsourcing is a powerful tool to rapidly meet a goal, and an exciting consequence of the internet’s transformational facilitation of connecting and communicating. But it holds some darker questions: to what extent does such a program exist to benefit the participant — and to what extent is the participant used as a resource to benefit the organization? Although the lines might be fuzzy to draw, the ownership and profit from user-generated data has become a clear motivation for companies (c.f. Facebook).

The Personal Genome Project has a lot of overlap with 23andWe in style. We want to collect similar information from participants — we ask people (if they are willing) to share information regarding their health and traits, as well as genome data. But there is also a key difference between the two projects: we do not hold this data privately for our own research. We release the data publicly for all others to see, and this is something we are uniquely able to do due to our open consent process. We want everyone — including our participants — to have as much access to the data as we do, and the same potential to make interesting discoveries.

As such, I see Personal Genome Project participants as very much our “peers” in this research endeavor. For this reason I prefer to use the phrase “peer production” rather than “crowdsourcing” to describe some aspects of our work (a term that can also be applied to projects like Linux and Wikipedia): not merely a project that solicits participant contributions, but one that genuinely shares those contributions as freely as possible.

A Python GEDCOM Parser

Excited by my discovery of Mayflower ancestry (or perhaps by the apparent confirmation that my genealogy records weren’t totally made up), I decided to contact other individuals on 23andme who were predicted to share DNA fragments with me and seek out other cases of family overlap.

The task was rapidly daunting! On 23andme there aren’t really any tools, the method of choice appears to be listing all surnames in one’s ancestry. The GEDCOM format genealogy file my father has documented is huge, I have currently have 384 ancestors in the document (and 163 surnames). Other genealogy buffs have similarly deep information — I slowly realized manually searching for overlaps between our lists was not at all practical.

My first “quick and dirty” attempt was to grep the file for last name matches. Little did I realize there are actually 1,547 individuals in my file! People who are not my direct ancestor (cousins and their spouses and children) are listed as well. On one hand this was really cool, more data is better… but on the other hand it meant a lot more thought was required.

To cut a long story short, I ended up finding an old GPL-licensed Python GEDCOM parser (linked here as “GEDCOM Parser”). I extensively improved it (in my humble opinion) and have uploaded the code to github as “python-gedcom“. The end result was a module I can use to pull out direct ancestors, search on last name matches, and return the path between me and a given ancestor.

Applying this to a new 23andme match (who also had 160 surnames!) I found 27 potential surname matches among my ancestors — all in the New England area. (This might simply reflect that my New England ancestors are the most extensively documented region in my tree.) I sent my distant relative this list of names, along with dates & locations of birth & death (where available).

From that, he found one definite overlap. Here’s the path from me to that ancestor, nine generations distant:

Gen 0  Madeleine Emily Price
Gen 1 . Paul Arms Price
Gen 2 .. Doris Madeline Arms
Gen 3 ... Howard William Arms
Gen 4 .... Jane Aitken
Gen 5 ..... Eliza Wales
Gen 6 ...... John Wales
Gen 7 ....... Lucy Strong
Gen 8 ........ Martha Stoughton
Gen 9 ......... John Stoughton

To be fair, I think it’s possible (even probable) that given our shared New England ancestry we have other points of overlap that we didn’t discover. I’m really pleased, though, at how tractable this task became once a program was applied: programming is a useful skill to have!

The Invisible Privilege of Not Being a Black Man

Mostly I post about science, but there was a very good Morning Edition item I’d like to share. It aired a couple days ago in the wake of the tragedy of Trayvon Martin’s death. I think it did an excellent job of discussing the topic, avoiding (justifiable) anger to provoke simple empathy. Because it’s a very emotionally-laden topic, I highly recommend you actually listen to it, not just read the transcript:



A mom’s advice to her young black sons


In light of the shooting death of Florida teen Trayvon Martin, Steve Inskeep speaks with writer Donna Britt and her sons Justin and Darrell Britt-Gibson about how she prepared them as young black men for a world that might view them with suspicion.

Yesterday, in his first time commenting on the case, Obama said: “If I had a son, he’d look like Trayvon.” That cuts to the painful core of this. People are more likely to believe a black man is dangerous and criminal. It’s not limited to “The South” — mistaken criminal assumptions of minorities happens in Boston and London. The consequences range from frustrating to horrifying.

The phrase “invisible privilege” has been used to describe a benefit that people on the favored side of a social divide aren’t usually conscious of. The story of Trayvon’s death brings my attention to a privilege I’m usually unaware of: I don’t have to live with this fear — the fear that my son or brother or husband could be mistaken for a dangerous criminal… and die for that mistake.