Archive for the 'Reproducible Research' Category

The nature of science in 2051

Right now scientific questions are chosen for study in a largely autocratic way. Typically grants for research on particular questions come from federal funding agencies, and scientists competitively apply with the money going to the chosen researcher via a peer review process.

I suspect, as the tools of online science become increasingly available, the real questions people face in their day to day lives will be more readily answered. If you think about all the things you do and decisions you make in a day, many of them don’t have a strong empirical basis. How you wash the dishes or do laundry, what foods are healthy, what environment to maintain in your house, what common illness remedies work best, who knows, but these types of questions, the ones that occur to you as you go about your daily business, aren’t prioritized in the investigatory model we have now for science. I predict that scientific investigation as a whole, not just that that is government funded, will move substantially toward providing answers to questions of local importance.

Regulatory steps toward open science and reproducibility: we need a science cloud

This past January Obama signed the America COMPETES Re-authorization Act. It contains two interesting sections that advance the notions of open data and the federal role in supporting online access to scientific archives: 103 and 104, which read in part:

• § 103: “The Director [of the Office of Science and Technology Policy at the Whitehouse] shall establish a working group under the National Science and Technology Council with the responsibility to coordinate Federal science agency research and policies related to the dissemination and long-term stewardship of the results of unclassified research, including digital data and peer-reviewed scholarly publications, supported wholly, or in part, by funding from the Federal science agencies.” (emphasis added)

This is a cause for celebration insofar as Congress has recognized that published articles are an incomplete communication of computational scientific knowledge, and the data (and code) must be included as well.

• § 104: Federal Scientific Collections: The Office of Science and Technology Policy “shall develop policies for the management and use of Federal scientific collections to improve the quality, organization, access, including online access, and long-term preservation of such collections for the benefit of the scientific enterprise.” (emphasis added)

I was very happy to see the importance of online access recognized, and hopefully this will include the data and code that underlies published computational results.

One step further in each of these directions: mention code explicitly and create a federally funded cloud not only for data but linked to code and computational results to enable reproducibility.

Generalize clinicaltrials.gov and register research hypotheses before analysis

Stanley Young is Director of Bioinformatics at the National Institute for Statistical Sciences, and gave a talk in 2009 on problems in modern scientific research. For example: 1 in 20 NIH-funded studies actually replicates; closed data and opacity; model selection for significance; multiple comparisons.. Here is the link to his talk: Everything Is Dangerous: A Controversy. There are a number of good examples in the talk and Young anticipates and is more intellectually coherent than the New Yorker article The Truth Wears Off if you were interested in that.

Idea: Generalize clinicaltrials.gov, where scientists register their hypotheses prior to carrying out their experiment. Why not do this for all hypothesis tests? Have a site where the hypotheses are logged and time stamped before researchers gather the data or carry out the actual hypothesis testing for the project. I’ve heard this idea mentioned occasionally and both Young and Lehrer mentions it as well.

Open peer review of science: a possibility

The Nature journal Molecular Systems Biology published an editorial “From Bench to Website” explaining their move to a transparent system of peer review. Anonymous referee reports, editorial decisions, and author responses are published alongside the final published paper. When this exchange is published, care is taken to preserve anonymity of reviewers and to not disclose any unpublished results. Authors also have the ability to opt out and request their review information not be published at all.

Here’s an example of the commentary that is being published alongside the final journal article.

Their move follows on a similar decision taken by The EMBO Journal (European Molecular Biology Organization) as described in an editorial here where they state that the “transparent editorial process will make the process that led to acceptance of a paper accessible to all, as well as any discussion of merits and issues with the paper.” Their reasoning cites problems in the process of scientific communication and they give an example by Martin Raff which was published as a letter to the editor called “Painful Publishing” (behind a paywall, apologies). Raff laments the power of the anonymous reviewers to demand often unwarranted additional experimentation as a condition of publication: “authors are so keen to publish in these select journals that they are willing to carry out extra, time consuming experiments suggested by referees, even when the results could strengthen the conclusions only marginally. All too often, young scientists spend many months doing such ‘referees’ experiments.’ Their time and effort would frequently be better spent trying to move their project forward rather than sideways. There is also an inherent danger in doing experiments to obtain results that a referee demands to see.”

Rick Trebino, physics professor at Georgia Tech, penned a note detailing the often incredible steps he went through in trying to publish a scientific comment: “How to Publish a Scientific Comment in 1 2 3 Easy Steps.” It describes deep problems in our scientific discourse today. The recent clinical trials scandal at Duke University is another example of failed scientific communication. Many efforts were made to print correspondences regarding errors in published papers that may have permitted problems in the research to have been addressed earlier.

The editorial in Molecular Systems Biology also announces that the journal is joining many others in adopting a policy of encouraging the upload of the data that underlies results in the paper to be published alongside the final article. They go one step further and provide links from the figure in the paper to its underlying data. They give an example of such linked figures here. My question is how this dovetails with recent efforts by Donoho and Gavish to create a system of universal figure-level identifiers for published results, and the work of Altman and King to design Universal Numerical Fingerprints (UNFs) for data citation.

My Symposium at the AAAS Annual Meeting: The Digitization of Science

Yesterday I held a symposium at the AAAS Annual Meeting in Washington DC, called “The Digitization of Science: Reproducibility and Interdisciplinary Knowledge Transfer,” that was intended to bring attention to how massive computation is changing the practice of science, particularly the lack of reproducibility of published computational scientific results. The fact is, most computational scientific results published today are unverified and unverifiable. I’ve created a page for the event here, with links to slide decks and abstracts. I couldn’t have asked for a better symposium, thanks to the wonderful speakers.

The first speaker was Keith A. Baggerly, who (now famously) tried to verify published results in Nature Medicine and uncovered a series of errors that led to the termination of clinical trials at Duke that were based on the original findings, and the resignation of one of the investigators (his slides). I then spoke about policies for realigning the IP framework scientists are under with their longstanding norms, to permit sharing of code and data (my slides). Fernando Perez described how computational scientists can learn about not only code sharing, quality control, and project management from the Open Source Software, but how they have in fact developed what is in effect a deeply successful system of peer review for code. Code is verified line by line before incorporated into the project, and there are software tools to enable the communication between reviewer and submitted, down to the line of code (his slides).

Michael Reich then presented GenePattern, an OS independent tool developed with Microsoft for creating data analysis pipelines and incorporating them into a Word doc. Once in the document, tools exist to click and recreate the figure from the pipeline and examine what’s been done to the data. Robert Gentlemen advocated the entire research paper as the unit of reproducibility, and David Donoho presented a method for assigning a unique identifier to figures within the paper, that creates a link for each figure and permits its independent reproduction (the slides). The final speaker was Mark Liberman, who showed how the human language technology community had developed a system of open data and code in their efforts to reduce errors in machine understanding of language (his slides). All the talks pushed on delineations of science from non-science, and it was probably best encapsulated with a quote Mark introduced from John Pierce, a Bell Labs executive in 1969, how “To sell suckers, one uses deceit and offers glamor.”

There was some informal feedback, with a prominent person saying that this session was “one of the most amazing set of presentations I have attended in recent memory.” Have a look at all the slides and abstracts, including links and extended abstracts.

Update: Here are some other blog posts on the symposium: Mark Liberman’s blog and Fernando Perez’s blog.

Letter Re Software and Scientific Publications – Nature

Mark Gerstein and I penned a reaction to two pieces published in Nature News last October, “Publish your computer code: it is good enough,” by Nick Barnes and “Computational Science…. Error” by Zeeya Merali. Nature declined to publish our note and so here it is.

Dear Editor,

We have read with great interest the recent pieces in Nature about the importance of computer codes associated with scientific manuscripts. As participants in the Yale roundtable mentioned in one of the pieces, we agree that these codes must be constructed robustly and distributed widely. However, we disagree with an implicit assertion, that the computer codes are a component separate from the actual publication of scientific findings, often neglected in preference to the manuscript text in the race to publish. More and more, the key research results in papers are not fully contained within the small amount of manuscript text allotted to them. That is, the crucial aspects of many Nature papers are often sophisticated computer codes, and these cannot be separated from the prose narrative communicating the results of computational science. If the computer code associated with a manuscript were laid out according to accepted software standards, made openly available, and looked over as thoroughly by the journal as the text in the figure legends, many of the issues alluded to in the two pieces would simply disappear overnight.

The approach taken by the journal Biostatistics serves as an exemplar: code and data are submitted to a designated “reproducibility editor” who tries to replicate the results. If he or she succeeds, the first page of the article is kitemarked “R” (for reproducible) and the code and data made available as part of the publication. We propose that high-quality journals such as Nature not only have editors and reviewers that focus on the prose of a manuscript but also “computational editors” that look over computer codes and verify results. Moreover, many of the points made here in relation to computer codes apply equally well to large datasets that underlie experimental manuscripts. These are often organized, formatted, and deposited into databases as an afterthought. Thus, one could also imagine a “data editor” who would look after these aspects of a manuscript. All in all, we have to come to the realization that current scientific papers are more complicated than just a few thousand words of narrative text and a couple of figures, and we need to update journals to handle this reality.

Yours sincerely,

Mark Gerstein (1,2,3)
Victoria Stodden (4)

(1) Program in Computational Biology and Bioinformatics,
(2) Department of Molecular Biophysics and Biochemistry, and
(3) Department of Computer Science,
Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520 Mark.Gerstein@Yale.edu

(4) Department of Statistics, Columbia University, 1255 Amsterdam Ave, New York, NY 10027
vcs@stodden.net

Open Data Dead on Arrival

In 1984 Karl Popper wrote a private letter to an inquirer he didn’t know, responding to enclosed interview questions. The response was subsequently published and in it he wrote, among other things, that:

“Every intellectual has a very special responsibility. He has the privilege and opportunity of studying. In return, he owes it to his fellow men (or ‘to society’) to represent the results of his study as simply, clearly and modestly as he can. The worst thing that intellectuals can do — the cardinal sin — is to try to set themselves up as great prophets vis-a-vis their fellow men and to impress them with puzzling philosophies. Anyone who cannot speak simply and clearly should say nothing and continue to work until he can do so.”

Aside from the offensive sexism in referring to intellectuals as males, there is another way this imperative should be updated for intellectualism today. The movement to make data available online is picking up momentum — as it should — and open code is following suit (see http://mloss.org for example). But data should not be confused with facts, and applying the simple communication that Popper refers to beyond the written or spoken word is the only way open data will produce dividends. It isn’t enough to post raw data, or undocumented code. Data and code should be considered part of intellectual communication, and made as simple as possible for “fellow men” to understand. Just as knowledge of adequate English vocabulary is assumed in the nonquantitative communication Popper refers to, certain basic coding and data knowledge can be assumed as well. This means the same thing as it does in the literary case; the elimination of extraneous information and obfuscating terminology. No need to bury interested parties in an Enron-like shower of bits. It also means using a format for digital communication that is conducive to reuse, such as a flat text file or another non-proprietary format, for example pdf files cannot be considered acceptable to either data or code. Facilitating reproducibility must be the gold standard for data and code release.

And who are these “fellow men”?

Well, fellow men and women that is, but back to the issue. Much of the history of scientific communication has dealt with the question of demarcation of the appropriate group to whom the reasoning behind the findings would be communicated, the definition of the scientific community. Clearly, communication of very technical and specialized results to a layman would take intellectuals’ time away from doing what they do best, being intellectual. On the other hand some investment in explanation is essential for establishing a finding as an accepted fact — assuring others that sufficient error has been controlled for and eliminated in the process of scientific discovery. These others ought to be able to verify results, find mistakes, and hopefully build on the results (or the gaps in the theory) and thereby further our understanding. So there is a tradeoff. Hence the establishment of the Royal Society for example as a body with the primary purpose of discussing scientific experiments and results. Couple this with Newton’s surprise, or even irritation, at having to explain results he put forth to the Society in his one and only journal publication in their journal Philosophical Transactions (he called the various clarifications tedious, and sought to withdraw from the Royal Society and subsequently never published another journal paper. See the last chapter of The Access Principle). There is a mini-revolution underfoot that has escaped the spotlight of attention on open data, open code, and open scientific literature. That is, the fact that the intent is to open to the public. Not open to peers, or appropriately vetted scientists, or selected ivory tower mates, but to anyone. Never before has the standard for communication been “everyone,” in fact quite the opposite. Efforts had traditionally been expended narrowing and selecting the community privileged enough to participate in scientific discourse.

So what does public openness mean for science?

Recall the leaked files from the University of East Anglia’s Climatic Research Unit last November. Much of the information revealed concerned scientifically suspect (and ethically dubious) attempts not to reveal data and methods underlying published results. Although that tack seems to have softened now some initial responses defended the climate scientists’ right to be closed with regard to their methods due to the possibility of “denial of service attacks” – the ripping apart of methodology (recall all science is wrong, an asymptotic progression toward to truth at best) not with the intent of finding meaningful errors that halt the acceptance of findings as facts, but merely to tie up the climate scientists so they cannot attend to real research. This is the same tradeoff as described above. An interpretation of this situation cannot be made without the complicating realization that peer review — the review process that vets articles for publication — doesn’t check computational results but largely operates as if the papers are expounding results from the pre-computational scientific age. The outcome, if computational methodologies are able to remain closed from view, is that they are directly vetted nowhere. Hardly an acceptable basis for establishing facts. My own view is that data and code must be communicated publicly with attention paid to Popper’s admonition: as simply and clearly as possible, such that the results can be replicated. Not participating in dialog with those insufficiently knowledgable to engage will become part of our scientific norms, in fact this is enshrined in the structure of our scientific societies of old. Others can take up those ends of the discussion, on blogs, in digital forums. But public openness is important not just because taxpayers have a right to what they paid for (perhaps they do, but this quickly falls apart since not all the public are technically taxpayers and that seems a wholly unjust way of deciding who shall have access to scientific knowledge and who not, clearly we mean society), but because of the increasing inclusiveness of the scientific endeavor. How do we determine who is qualified to find errors in our scientific work? We don’t. Real problems will get noticed regardless of with whom they originate, many eyes making all bugs shallow. And I expect peer review for journal publishing to incorporate computational evaluation as well.

Where does this leave all the open data?

Unused, unless efforts are expended to communicate the meaning of the data, and to maximize the usability of the code. Data is not synonymous with facts – methods for understanding data, and turning its contents into facts, are embedded within the documentation and code. Take for granted that users understand the coding language or basic scientific computing functions, but clearly and modestly explain the novel contributions. Facilitate reproducibility. Without this data may be open, but will remain de facto in the ivory tower.

Ars technica article on reproducibility in science

John Timmer wrote an excellent article called “Keeping computers from ending science’s reproducibility.” I’m quoted in it. Here’s an excellent follow up blog post by Grant Jacobs, “Reproducible Research and computational biology.”

Code Repository for Machine Learning: mloss.org

The folks at mloss.org — Machine Leaning Open Source Software — invited a blog post on my roundtable on data and code sharing, held at Yale Law School last November. mloss.org’s philosophy is stated as:

“Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for a wide range of applications. Inspired by similar efforts in bioinformatics (BOSC) or statistics (useR), our aim is to build a forum for open source software in machine learning.”

The site is excellent and worth a visit. The guest blog Chris Wiggins and I wrote starts:

“As pointed out by the authors of the mloss position paper [1] in 2007, “reproducibility of experimental results is a cornerstone of science.” Just as in machine learning, researchers in many computational fields (or in which computation has only recently played a major role) are struggling to reconcile our expectation of reproducibility in science with the reality of ever-growing computational complexity and opacity. [2-12]

In an effort to address these questions from researchers not only from statistical science but from a variety of disciplines, and to discuss possible solutions with representatives from publishing, funding, and legal scholars expert in appropriate licensing for open access, Yale Information Society Project Fellow Victoria Stodden convened a roundtable on the topic on November 21, 2009. Attendees included statistical scientists such as Robert Gentleman (co-developer of R) and David Donoho, among others.”

keep reading at http://mloss.org/community/blog/2010/jan/26/data-and-code-sharing-roundtable/. We made an effort to reference efforts in other fields regarding reproducibility in computational science.

Video from "The Great Climategate Debate" held at MIT December 10, 2009

This is an excellent panel discussion regarding the leaked East Anglia docs as well as standards in science and the meaning of the scientific method. It was recorded on Dec 10, 2009, and here’s the description from the MIT World website: “The hacking of emails from the University of East Anglia’s Climate Research Unit in November rocked the world of climate change science, energized global warming skeptics, and threatened to derail policy negotiations at Copenhagen. These panelists, who differ on the scientific implications of the released emails, generally agree that the episode will have long-term consequences for the larger scientific community.”

Moderator: Henry D. Jacoby, Professor of Management, MIT Sloan School of Management, and Co-Director, Joint Program on the Science and Policy of Global Change, MIT.

Panelists:
Kerry Emanuel, Breene M. Kerr Professor of Atmospheric Science, Department of Earth, Atmospheric Science and Planetary Sciences, MIT;
Judith Layzer, Edward and Joyce Linde Career Development Associate Professor of Environmental Policy, Department of Urban Studies and Planning, MIT;
Stephen Ansolabehere, Professor of Political Science, MIT, and
Professor of Government, Harvard University;
Ronald G. Prinn, TEPCO Professor of Atmospheric Science, Department of Earth, Atmospheric and Planetary Sciences, MIT Director, Center for Global Change Science; Co-Director of the MIT Joint Program on the Science and Policy of Global Change;
Richard Lindzen, Alfred P. Sloan Professor of Meteorology, Department of Earth, Atmospheric and Planetary Sciences, MIT.

Video, running at nearly 2 hours, is available at http://mitworld.mit.edu/video/730.