Changes in the Research Process Must Come From the Scientific Community, not Federal Regulation

I wrote this piece as an invited policy article for a major journal but they declined to publish it. It’s still very much a draft and they made some suggestions, but since realistically I won’t be able to get back to this for a while and the text is becoming increasingly dated, I thought I would post it here. Enjoy!

Recent U.S. policy changes are mandating a particular vision of scientific communication: public access to data and publications for federally funded research. On February 22, 2013, the Office of Science and Technology Policy (OSTP) in the Whitehouse released an executive memorandum instructing the major federal funding agencies to develop plans to make both the datasets and research articles resulting from their grants publicly available [1]. On March 5, the House Science, Space, and Technology subcommittee convened a hearing on Scientific Integrity & Transparency and on May 9, President Obama issued an executive order requiring government data to be made openly available to the public [2].

Many in the scientific community have demanded increased data and code disclosure in scholarly dissemination to address issues of computational reproducibility in computational science [3-19]. At first blush, the federal policies changes appear to support these scientific goals, but the scope of government action is limited in ways that can impair its ability to respond directly to these concerns. The scientific community should not rely on federal policy to bring about changes that enable reproducible computational research. These recent policy changes must be a catalyst for a well-considered update in research dissemination standards by the scientific community: computational science must move to publication standards that include the digital data and code sufficient to permit others in the field to computationally reproduce and verify the results. Authors and journals must be ready to use existing repositories and infrastructure to ensure the communication of reproducible computational discoveries.

The Credibility Crisis

Jon Claerbout, professor emeritus of geophysics at Stanford University, warned of the credibility crisis in computational science more than 20 years ago [20]. Paraphrasing Claerbout one might say that “an article about computational science in a scientific publication is not the scholarship itself, it’s merely scholarship advertisement. The actual scholarship is the complete set of software instructions and data which generated the figures and tables.” [21] Today, most published computational results are not directly verifiable, not because the science is inherently flawed but because there are no standards enabling access to the data and code that make replication possible [7, 22-24].

In a sense, reproducibility in computational science has hit the mainstream. Workshops and conference sessions on “reproducible research” can be found in fields from aeronautics to zoology [25, 26]. Publications and manifestoes on reproducibility come from many disparate voices. One of the effects of today’s widespread use of computation in scientific research is an increase in the complexity of workflows associated with scientific discoveries. A research process that uses computational tools and digital data introduces new potential sources of error: Were the methods described in the paper transcribed correctly into computer code? What were the parameters settings and input data files? How were the raw data filtered and prepared for analysis? Are the figures and tables produced by the code the same as reported in the published article? The list goes on. Access to the data and code that produced the result is necessary, both for replication and validation purposes as well as for reconciling differences in independent implementations [3].

Recent Federal Policy Changes: A Catalyst

Recent federal government actions have the potential to help the scientific community close this credibility gap. Along with the OSTP public access executive memorandum, the May 9 Presidential executive order entitled “Making Open and Machine Readable the New Default for Government Information” instructs the federal government to issue an Open Data Policy binding on federal agencies. The executive order pertains to government datasets created within agencies, of which many are and can be used for research.

Congress is currently considering the issue of public availability of federally funded data and publications arising from federally funded research in the reauthorization of the America COMPETES Act, which funds research programs at several federal agencies including the National Science Foundation and the Department of Energy. At the March 5 the Congressional Science, Space, and Technology subcommittee hearing on Scientific Integrity & Transparency, the chairman’s opening remarks stated:

The growing lack of scientific integrity and transparency has many causes but one thing is very clear: without open access to data, there can be neither integrity nor transparency from the conclusions reached by the scientific community. Furthermore, when there is no reliable access to data, the progress of science is impeded and leads to inefficiencies in the scientific discovery process. Important results cannot be verified, and confidence in scientific claims dwindles.

As with the OSTP executive memo and Presidential executive order, this also appears to be a boon to scientific transparency. The bill is being drafted as this article goes to press, but there are in fact many barriers to legislating reproducible computational research that make the federal government’s efforts a partial solution at best.

The Limits of Federal Policy for Scientific Research

Because of the Federal government’s central role in sponsoring scientific research, federal policy makers can wield enormous influence over how that research is carried out and how it is accessed. Policy makers’ incentives and their range of potential actions can differ from those of the scientific community however.

The Federal Government is responsible to American voters as a whole. Scientific research is highly internationally collaborative, and as one house member asked during the March 5 hearing, “As we move into this era of wanting to share data, we also have to maintain our competitive advantage, and we do have competitor nations that every day are trying to get to our data, get to the research institutions. How [do we] we move forward in an open transparent way but maintaining our competitive advantage and protecting those discoveries that we’re making?” In addition, the policy rhetoric around public access to data and publication seems to be oriented primarily around accelerating economic growth rather than scientific integrity, as can be seen in phrases such as “[a]ccess to digital data sets resulting from federally funded research allows companies to focus resources and efforts on understanding and exploiting discoveries” and “[o]penness in government strengthens our democracy, promotes the delivery of efficient and effective services to the public, and contributes to economic growth.”

Federal funding agency regulation only directly affects federally funded research. Not all research is funded by the federal funding agencies. The policy changes discussed in this article directly regulate federal research grants and government produced datasets.

Lobbying groups can exert enormous influence over the government rule making process. Non-researchers such as commercial publishers or rare disease research advocates maintain a strong presence on Capitol Hill. Since the scientific research community tends not to lobby, this can skew the outcome of the policy making process away from alignment with longstanding scientific norms. Policymakers shouldn’t be assumed to have a deep knowledge of how the scientific research process works and scientists should be engaged in the policy making process when it concerns how science is carried out and disseminated.

The scope of potential federal action is constrained by legal barriers. When scientists discuss resolving the credibility crisis, their calls are for conveniently available data and code. Federal policy makers, by contrast, call for “public access” to data and to publications. These responses are differ in two principal ways. Federal policy makes data more widely available than scientists might have proposed, and neglects to mention software, methods, and reproducibility as a rational for sharing. While I believe few scientists would quibble with making the results of their research publicly available, reproducibility is a cornerstone of the scientific method. One possible reason code is largely absent from the federal policy discussion is that software is potentially patentable. The Bayh-Dole Act, passed in 1980, gives research institutions title to government-funded inventions, including those embodied in software, and charges them with using the patent system to encourage disclosure and commercialization of the inventions, making software harder to regulate than data or publications.

In short, the scientific community must meet its goals for reproducibility and transparency itself, rather than relying solely on the federal government.

Responses from the Scientific Community and Ways Forward

The scientific community has taken a number of steps to bring about reproducible computational science. Some journals require authors to make the data and code required to reproduce the published results available upon request [11, 27]. I am involved with a Presidential Initiative of the American Statistical Association to develop a “portal” to make digital scholarly objects associated with statistics research publicly available. Many universities and funding agencies have created online repositories to support the dissemination of digital scholarly objects. A nascent community of software developers is making an increasing number of tools available for scientific workflow management and the dissemination of reproducible computational results [26, wiki].

It is not costless to prepare data and code for release [29,30]. Researchers are currently rewarded for manuscript publication, not data and code production, which has the effect of penalizing those that practice reproducible research [31]. One way to break this collective action problem is for community members to all move to new practices at the same time, and the changes in federal policy provide such an opportunity. This time is now. Journals must update their publication requirements to include the deposit of data and code that can replicate all results in the manuscript. These data and codes can be deposited at any repository that can reasonably guarantee a stable URL, to be provided in the text of the published paper. Scientists should expect to be able to replicate computational results, including when evaluating research for the purpose of making hiring and promotional decisions. When complex architectures have been used, sharing the code allows others to inspect and perhaps adapt the methods [32]. Reasons data or code cannot be shared, such as patient privacy, should be disclosed to journal editors at the time of submission and exceptions agreed upon early in the process [33]. The scientific community has an opportunity to extend the title of Presidential executive order and “Make Open the New Default for Published Computational Science” and take steps toward really reproducible computational research.

Policy deliberations are currently ongoing and input from the scientific community can impact outcomes that will affect the scientific research process. Congress is considering open data and publication requirements in the reauthorization of the American COMPETES Act. The OSTP executive memorandum directs the federal funding agencies to deliver a plan to the OSTP on August 22 of this year and direct each agency to “use a transparent process for soliciting views from stakeholders, including federally funded researchers, universities, libraries, publishers, users of federally funded research results, and civil society groups, and take such views into account.”

References

[1] Whitehouse Office of Science and Technology Policy Executive Memorandum, “Increasing Access to the Results of Federally Funded Scientific Research,” Feb 22, 2013. http://www.whitehouse.gov/blog/2013/02/22/expanding-public-access-results-federally-funded-research

[2] Presidential Executive Order, “Making Open and Machine Readable the New Default for Government Information,” May 9, 2013. http://www.whitehouse.gov/the-press-office/2013/05/09/executive-order-making-open-and-machine-readable-new-default-government-

[3] Donoho, D., A. Maleki, I. Ur Rahman, M. Shahram, V. Stodden, “Reproducible Research in Computational Harmonic Analysis,” Computing in Science and Engineering, Vol 11, Issue 1, 2009, p. 8-18. http://www.computer.org/csdl/mags/cs/2009/01/mcs2009010008-abs.html

[4] King, G., “Replication, Replication,” Political Science and Politics, Vol 28, p. 443-499, 1995. http://gking.harvard.edu/files/abs/replication-abs.shtml

[5] Begley, C., L. Ellis, “Drug Development: Raise Standards for Preclinical Cancer Research,” Nature, 483 (7391), Mar 29, 2012. http://www.nature.com/nature/journal/v483/n7391/full/483531a.html

[6] Nature Publishing Group, “Announcement: Reducing our Irreproducibility,” Nature, 496 (398), April 25, 2013. http://www.nature.com/news/announcement-reducing-our-irreproducibility-1.12852

[7] Alsheikh-Ali, A, W. Qureshi, M. Al-Mallah, and J. P. A. Ioannidis, “Public Availability of Published Research Data in High-Impact Journals,” PLoS ONE, 6(9), 2011. http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0024357

[8] Bailey, D., J. Borwein, “Exploratory, Experimentation, and Computation,” Notices of the AMS, 58(10), Nov, 2011. http://www.ams.org/notices/201110/rtx111001410p.pdf

[9] Nature Publishing Group, “Must Try Harder,” Nature, 483 (509), Mar 29, 2012. http://www.nature.com/nature/journal/v483/n7391/full/483509a.html

[10] Morin, A., J. Urban, P. Adams, I. Foster, A. Sali, D. Baker, P. Sliz, “Shining Light into Black Boxes,” Science, 336 (6078), April 13, 2012. https://www.sciencemag.org/content/336/6078/159.summary

[11] Hanson, B., A. Sugden, B. Alberts, “Making Data Maximally Available,” Science, 331 (6018), Feb 11, 2011. http://www.sciencemag.org/content/331/6018/649.short

[12] Fanelli, D., “Redefine Misconduct as Distorted Reporting,” Nature, 494 (7436), Feb 13, 2013. http://www.nature.com/news/redefine-misconduct-as-distorted-reporting-1.12411

[13] Barnes, N., “Publish your computer code: it is good enough,” Nature, 467 (753), Oct 13, 2010. http://www.nature.com/news/2010/101013/full/467753a.html

[14] Yong, E., “Replication Studies: Bad Copy,” Nature, 485 (7398), May 16, 2012. http://www.nature.com/news/replication-studies-bad-copy-1.10634

[15] Yong, E., “Nobel laureate challenges psychologists to clean up their Act,” Nature, Oct 3, 2012. http://www.nature.com/news/nobel-laureate-challenges-psychologists-to-clean-up-their-act-1.11535

[16] “Devil in the details,” Nature editorial, Vol 470, p. 305-306, Feb. 17, 2011. http://www.nature.com/nature/journal/v470/n7334/full/470305b.html

[17] “Reproducible Research,” Special Issue, Computing in Science and Engineering, Vol 14, Issue 4, 2012, p. 11-56.
http://www.computer.org/portal/web/csdl/abs/mags/cs/2012/04/mcs201204toc.htm

[18] Stodden, V., I. Mitchell, and R. LeVeque, “Reproducible Research for Scientific Computing: Tools and Strategies for Changing the Culture,” Computing in Science and Engineering, Vol 14, Issue 4, p. 13-17, 2012. http://www.computer.org/csdl/mags/cs/2012/04/mcs2012040013-abs.html

[19] Committee on Responsibilities of Authorship in the Biological Sciences, National Research Council, Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences, National Academies Press, Washington, D.C., U. S. A., 2003. http://www.nap.edu/openbook.php?record_id=10613&page=27

[19] J. Kovacevic, “How to encourage and publish reproducible research,” Proc. IEEE Int. Conf. Acoust. Speech, and Signal Proc., IV, 2007.

[19] Baggerly. K., D. Barry, “Reproducible research,” 2011; http://magazine.amstat.org/blog/2011/01/01/scipolicyjan11/

[20] Claerbout, J., M. Karrenbach, “Electronic Documents Give Reproducible Research a New Meaning,” 1992. http://sepwww.stanford.edu/doku.php?id=sep:research:reproducible:seg92

[21] Buckheit, J., D.L. Donoho, “Wavelab and Reproducible Research,” Wavelets and Statistics, A. Antoniadis, ed., Springer-Verlag, 1995, pp. 55-81.

[22] Ioannidis, J., “Why Most Published Research Findings are False,” PLoS Medicine, vol. 2, no. 8, 2005, pp. 696-701.

[23] Nosek, B. A., Spies, J. R., Motyl, M., “Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability.” Perspectives on Psychological Science, 7, 615-631, 2012. http://pps.sagepub.com/content/7/6/615.abstract

[24] Stodden, V., “Enabling Reproducible Research: Open Licensing For Scientific Innovation,” International Journal of Communications Law and Policy, Issue 13, 2009. http://ijclp.net/old_website/article.php?doc=1&issue=13_2009

[25] Yale Roundtable Participants, “Reproducible Research,” Vol 12, Issue 5, 2010. http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2010.113

[26] Stodden, V., D. H. Bailey, J. Borwein, R. J. LeVeque, W. Rider and W. Stein, “Setting the Default to Reproducible: Reproducibility in Computational and Experimental Mathematics,” Feb 2, 2013, http://stodden.net/icerm_report.pdf

[27] Stodden, V., P. Guo, Z. Ma, “Toward Reproducible Computational Research:
An Empirical Analysis of Data and Code Policy Adoption by Journals,” PLoSONE 8(6), 2013, http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0067111

[28] ICERM Reproducibility Workshop Wiki: http://wiki.stodden.net/ICERM_Reproducibility_in_Computational_and_Experimental_Mathematics:_Readings_and_References

[29] Stodden, V., “The Scientific Method in Practice: Reproducibility in the Computational Sciences”, MIT Sloan Research Paper No. 4773-10. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1550193

[30] Wilson, G., D. Aruliah, C. Brown, N. Chue Hong, M. Davis, R. Guy, S. Haddock, K. Huff, I. Mitchell, M. Plumbley, B. Waugh, E. White, P. Wilson, “Best Practices for Scientific Computing.”

[31] Stodden, V., J. Borwein, and D.H. Bailey, “”Setting the Default to Reproducible” in Computational Science Research,” SIAM News, June 3, 2013. http://www.siam.org/news/news.php?id=2078

[32] LeVeque, R., “Top Ten Reasons To Not Share Your Code (and why you should anyway),” SIAM News, April 1, 2013. http://www.siam.org/news/news.php?id=2064

[33] Borwein, J., D. Bailey, V. Stodden, “Set the Default to ‘Open’,” Notices of the AMS, June/July, 2013. http://www.ams.org/notices/201306/rnoti-p679.pdf

This entry was posted in Intellectual Property, Law, Open Data, Open Science, OSTP, Reproducible Research. Bookmark the permalink.

5 Responses to Changes in the Research Process Must Come From the Scientific Community, not Federal Regulation

  1. Pingback: Best of replication & data sharing Collection 3 | Political Science Replication

  2. Pingback: Changes in the Research Process Must Come From the Scientific Community « Berkeley Initiative for Transparency in the Social Sciences

Leave a Reply

Your email address will not be published. Required fields are marked *