Monthly Archive for September, 2014

My input for the OSTP RFI on reproducibility

Until Sept 23 2014, the US Office of Science and Technology Policy in the Whitehouse was accepting comments on their “Strategy for American Innovation.” My submitted comments on one part of that RFI, section 11:

“11) Given recent evidence of the irreproducibility of a surprising number of published scientific findings, how can the Federal Government leverage its role as a significant funder of scientific research to most effectively address the problem?”

follow (corrected for typos).

This comment is directed at point 11, requesting comments on the reproducibility of scientific findings. I believe there are two threads to this issue: a traditional problem that has been in science for hundreds of years whose traditional solution has been the methods section in the scientific publication; secondly, a new issue that has arisen over the last twenty years as computation has assumed a central role in scientific research. This new element is not yet accommodated in scientific publication, and introduces serious consequences for reproducibility.

Putting aside the first issue of traditional reproducibility, for which longstanding solutions exist, I encourage the federal government, in concert with the scientific community, to consider how the current set of laws and funding agency practices do not support the production of reproducible computational science.

In all research that utilizes a computer, instructions for the research are stored in software and scientific data are stored digitally. A typical publication in computational research is based foundationally on data, and the computer instructions applied to the data that generated the scientific findings. The complexity of the data generation mechanism and the computational instructions is typically very large, too large to capture in a traditional scientific publication. Hence when computers are involved in the research process, scientific publication must shift from a scientific article to the triple of scientific paper, and the software and data from which the findings were generated. This triple has been referred to as a “research compendia” and its aim is to transmit research findings that others in the field will be able to reproduce by running the software on the data. Hence, data and software that permits others to reproducible the findings must be made available.

There are two primary laws come to bear on this idea of computational reproducibility. The first is copyright law, which adheres to software and to some degree to data. Software and data from scientific research should not receive the same legal protection as most original artistic works receive from copyright law. These objects should be made openly available by default (rather than closed by copyright law by default) with attribution for the creators.

Secondly, the Bayh-Dole Act from 1980 is having the effect of creating less transparency and less knowledge and technology transfer due to the use of the computer in scientific research. Bayh-Dole charges the institutions that support research, such as universities, to use the patent system for inventions that derive under its auspices. Since software may be patentable, this introduces a barrier to knowledge transfer and reproducibility. A research compendia would include code and would be made openly available, where as Bayh-Dole adds an incentive to create a barrier by introducing the option to patent software. Rather than openly available software, a request to license patented software would need to submitted to the University and appropriate rates negotiated. For the scientific community, this is equivalent to closed unusable code.

I encourage you to rethink the legal environment that attends to the digital objects produced by scientific research in support of research findings: the software; the data; and the digital article. Science, as a rule, demands that these be made openly available to society (as do scientists) and unfortunately they are frequently captured by external third parties, using copyright transfer and patents, that restrict access to knowledge and information that has arisen from federal funding. This retards American innovation and competitiveness.

Federal funding agencies and other government entities must financially support the sharing, access, and long term archiving of research data and code that supports published results. With guiding principles from the federal government, scientific communities should implement infrastructure solutions that support openly available reproducible computational research. There are best practices in most communities regarding data and code release for reproducibility. Federal action is needed since the scientific community faces a collection action problem: producing research compendia, as opposed to a published article alone, is historically unrewarded. In order to change this practice, the scientific community must move in concert. The levers exerted by the federal funding agencies are key to breaking this collective action problem.

Finally, I suggest a different wording for point 11 in your request. Scientific findings are not the level at which to think about reproducibility, it is better to think about enabling the replication of the research process that is associated with published results, rather than the findings themselves. This is what provides for research that is reproducible and reliable. When different processes are compared, whether or not they produce the same result, the availability of code and data will enable the reconciliation of differences in methods. Open data and code permit reproducibility in this sense and increase the reliability of the scholarly record by permitting error detection and correction.

I have written extensively on all these issues. I encourage you to look at http://stodden.net, especially the papers and talks.