Disrupt science: But not how you’d think

Two recent articles call for an openness revolution in science: one on GigaOM and the other in the Wall Street Journal. But they’ve got it all wrong. These folks are missing that the process of scientific discovery is not, at its core, an open process. It only becomes an open process at the point of publication.

I am not necessarily in favor of greater openness during the process of scientific collaboration. I am however necessarily in favor of openness in communication of the discoveries at the time of publication. Publication is the point at which authors feel their work is ready for public presentation and scrutiny (that traditional publication does not actually give the public access to new scientific knowledge is a tragedy and of course we should have Open Access). We even have a standard for the level of openness required at the time of scientific publication: reproducibility. This concept has been part of the scientific method since it was instigated by Robert Boyle in the 1660’s: communicate sufficient information such that another researcher in the field can replicate your work, given appropriate tools and equipment.

If we’ve already been doing this for hundreds of years what’s the big deal now? Leaving aside the question of whether or not published scientific findings have actually been reproducible over the last few hundred years (for work on this question see e.g. Changing Order: Replication and Induction in Scientific Practice by Harry Collins), science, like many other areas of modern life, is being transformed by the computer. It is now tough to find any empirical scientific research not touched by the computer in some way, from simply storing and analyzing records to the radical transformation of scientific inquiry through massive multicore simulations of physical systems (see e.g. The Dynamics of Plate Tectonics and Mantle Flow: From Local to Global Scales).

This, combined with the facilitation of digital communication we call the Internet, is an enormous opportunity for scientific advancement. Not because we can collaborate or share our work pre-publication as the articles assert – all of which we can do, if we like – but because computational research captures far more of the tacit knowledge involved in replicating a scientific experiment that ever before, making our science potentially more verifiable. The code and data another scientist believes replicates his or her experiments can capture all digital manipulations of the data, which now comprise much of the scientific discovery process. Commonly used techniques, such as the simulations carried out in papers included in sparselab.stanford.edu (see e.g “Breakdown Point of Model Selection when the Number of Variables Exceeds the Number of Observations” or any of the SparseLab papers), can now be replicated by downloading the short scripts we included with publication (see e.g. Reproducible Research in Computational Harmonic Analysis). This is a far cry from Boyle’s physical experiments with the air pump, and I’d argue one with enormously lower levels of tacit knowledge to communicate that we’re not capitalizing on today.

Computational experiments are complex. Without communicating the data and code that generated the result it is nearly impossible to understand what was done. My thesis advisor famously paraphrased this idea, “An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.”

It is simply wrong to assert that science has not adopted computational tools, as the GigoOM article does when it says “there is another world that makes these industries [traditional media players such as newspapers, magazines and book publishers] look like the most enthusiastic of early adopters [of technological progress]: namely, academic research.” Everywhere you look computation is emerging as central to the scientific method. What needs to happen is a resolution of the credibility crisis in computational science resulting from this enthusiastic technological adoption: we need to establish the routine inclusion of the data and code with publication, such that the results can be conveniently replicated and verified. Many new efforts to help data and code sharing are documented here, from workflow tracking to code portability to unique identifiers that permit the regeneration of results in the cloud.

This is an imperative issue, most of our published computational results today are unverified and unverifiable (I have written on this extensively, as have many others). The emphasis on the process of creative problem solving as the bottleneck in scientific discovery is misplaced. Scientific innovation is inherently creative – perhaps more collaborative tools that permit openness will encourage greater innovation. Or perhaps it is the case, as Paul Graham has pointed out in another context, that innovative thinking is largely a solitary act. What is clear is that it is of primary importance to establish publication practices that facilitate the replication and verification of results by including data and code, and not to confound these two issues.

5 Responses to “Disrupt science: But not how you’d think”


  • Agree totally about reproducibility being the key framing issue, but I also think the concept of publication will change, so openness can happen after publication at the same time publication happens on smaller, more atomic scales.

  • This is a great post! I agree that reproducibility is crucial, and also that it is often wrongly lumped together with collaborative tools that (sometimes) also support openness. I was inspired to write a post exploring the current possibilities for reproducibility by looking at two examples of current efforts (in biology) to make research reproducible http://elnblog.axiope.com/?p=1022.

  • Victoria, thanks for an excellent & provocative post. You seem to set up a dichotomy between openness & reproducibility which I don’t understand. Disclosing your methods/code & data, is necessary to be reproducible. The distinction you really seem to be focusing on, however, is the difference between being open before publication vs after publication.

    Of course publication isn’t such a clear dividing line — since those data and codes will be used in later publications by their author. The sooner one chooses to disclose the information, the higher the risks involved and the higher the potential rewards.

    The two pieces you cite emphasize technology that is *collaborative,* and present the thesis that science is slower and less efficient because of it isn’t adopted. Let’s call this an “Open Science” thesis.

    You emphasize a technology that is *computational,* and present the thesis its ubiquitous adoption has made science less reproducible & unverified, (while a few clever folks have developed technology to make it easy to reproduce computational results). Let’s call this the “Reproducible Research” thesis.

    Is the difference between these theses of how to “disrupt science” one of goals, or just one of methods? Does it matter if I do all my research polymath style on blogs, vs working in secrecy until the day I publish when I release all my code and data to J. of Biostatistics and get an R-mark for reproducible? These approaches differ in (a) the tools they use (b) their position on the release-day timeline, but I believe they pursue the same outcome.

  • The link to your paper “Reproducible Research in Computational Harmonic Analysis” appears to be broken.
    (It was easy to find, of course, but I thought I’d let you know.)

    Neil

  • Apologies – here are the correct links: http://www.stanford.edu/~vcs/papers/RRCiSE-STODDEN2009.pdf and

    Reproducible Research in Computational Harmonic Analysis
    Comput. Sci. Eng. 11, 8 (2009)
    http://dx.doi.org/10.1109/MCSE.2009.15

Leave a Reply