Until Sept 23 2014, the US Office of Science and Technology Policy in the Whitehouse was accepting comments on their “Strategy for American Innovation.” My submitted comments on one part of that RFI, section 11 follow (with minor corrections):
“11) Given recent evidence of the irreproducibility of a surprising number of published scientific findings, how can the Federal Government leverage its role as a significant funder of scientific research to most effectively address the problem?”
This comment is directed at point 11, requesting comments on the reproducibility of scientific findings. I believe there are two threads to this issue: a traditional problem that has been in science for hundreds of years whose traditional solution has been the methods section in the scientific publication; secondly, a new issue that has arisen over the last twenty years as computation has assumed a central role in scientific research. This new element is not yet accommodated in scientific publication, and introduces serious consequences for reproducibility.
Putting aside the first issue of traditional reproducibility, for which longstanding solutions exist, I encourage the federal government, in concert with the scientific community, to consider how the current set of laws and funding agency practices do not support the production of reproducible computational science.
In all research that utilizes a computer, instructions for the research are stored in software and scientific data are stored digitally. A typical publication in computational research is based foundationally on data, and the computer instructions applied to the data that generated the scientific findings. The complexity of the data generation mechanism and the computational instructions is typically very large, too large to capture in a traditional scientific publication. Hence when computers are involved in the research process, scientific publication must shift from a scientific article to the triple of scientific paper, and the software and data from which the findings were generated. This triple has been referred to as a “research compendia” and its aim is to transmit research findings that others in the field will be able to computationally reproduce by running the software on the data. Hence, data and software that permits others to reproducible the findings must be made available.
There are two primary laws come to bear on this idea of computational reproducibility. The first is copyright law, which adheres to software and arguably to some degree to data. Software and data from scientific research, in particular federally funded scientific research, should be made openly available by default (rather than closed by copyright law by default) with attribution for the creators. This allows the computational research process to conform more closely to the scientific method.
Secondly, the Bayh-Dole Act from 1980 charges the institutions that support research, such as universities, to use the patent system for inventions that derive under its auspices. Since software may be patentable, this introduces a potential barrier to knowledge transfer and reproducibility. Rather than openly available software, a request to license patented software would need to submitted to the University and appropriate rates negotiated. For the scientific community, this would likely be equivalent to closed unusable code. Research codes need to be seen as accelerators of scientific research and discovery and a community resource (with attribution) like scientific results.
I encourage you to rethink the legal environment that attends to the digital objects produced by scientific research in support of research findings: the software; the data; and the digital article. Science, as a rule, demands that these be made openly available to society (as does scientific community) and unfortunately they can be captured by external third parties, using copyright transfer and patents, that restrict access to knowledge and information that has arisen from federal funding. This slows American innovation and competitiveness without offsetting gains.
Federal funding agencies and other government entities must financially support the sharing, access, and long term archiving of research data and code that supports published results. With guiding principles from the federal government, scientific communities should implement infrastructure solutions that support openly available reproducible computational research. There are best practices in most communities regarding data and code release for reproducibility. Federal action is needed since the scientific community faces a collection action problem: producing research compendia, as opposed to a published manuscript alone, is historically unrewarded. In order to change this practice, the scientific community must move in concert. The levers exerted by the federal funding agencies are key to breaking this collective action problem.
Finally, I suggest a different wording for point 11 in your request. Scientific findings are not the level at which to think about reproducibility, it is more productive to think about enabling the replication of the research process that is associated with published results. This is what provides for research that is reproducible and reliable. When different processes are compared, whether or not they produce the same result, the availability of code and data can provide the only way to reconcile differences in methods. Open data and code permit computational reproducibility and increase the reliability of the scholarly record by permitting error detection and correction.
I have written extensively on all these issues. I encourage you to look at http://stodden.net, especially my publications and talks.
2 Responses to My input for the OSTP RFI on reproducibility