Changes in the Research Process Must Come From the Scientific Community, not Federal Regulation

I wrote this piece as an invited policy article for a major journal but they declined to publish it. It’s still very much a draft and they made some suggestions, but since realistically I won’t be able to get back to this for a while and the text is becoming increasingly dated, I thought I would post it here. Enjoy!

Recent U.S. policy changes are mandating a particular vision of scientific communication: public access to data and publications for federally funded research. On February 22, 2013, the Office of Science and Technology Policy (OSTP) in the Whitehouse released an executive memorandum instructing the major federal funding agencies to develop plans to make both the datasets and research articles resulting from their grants publicly available [1]. On March 5, the House Science, Space, and Technology subcommittee convened a hearing on Scientific Integrity & Transparency and on May 9, President Obama issued an executive order requiring government data to be made openly available to the public [2].

Many in the scientific community have demanded increased data and code disclosure in scholarly dissemination to address issues of reproducibility and credibility in computational science [3-19]. At first blush, the federal policies changes appear to support these scientific goals, but the scope of government action is limited in ways that impair its ability to respond directly to these concerns. The scientific community cannot rely on federal policy to bring about changes that enable reproducible computational research. These recent policy changes must be a catalyst for a well-considered update in research dissemination standards by the scientific community: computational science must move to publication standards that include the digital data and code sufficient to permit others in the field to replicate and verify the results. Authors and journals must be ready to use existing repositories and infrastructure to ensure the communication of reproducible computational discoveries.

The Credibility Crisis

Jon Claerbout, professor emeritus of geophysics at Stanford University, warned of the credibility crisis in computational science more than 20 years ago [20]. Paraphrasing Claerbout one might say that “an article about computational science in a scientific publication is not the scholarship itself, it’s merely scholarship advertisement. The actual scholarship is the complete set of software instructions and data which generated the figures and tables.” [21] Today, most published computational results are not directly verifiable, not because the science is inherently flawed but because there are no standards enabling access to the data and code that make replication possible [7, 22-24].

In a sense, reproducibility in computational science has hit the mainstream. Workshops and conference sessions on “reproducible research” can be found in fields from aeronautics to zoology [25, 26]. Publications and manifestos on reproducibility are coming from many disparate voices. One of the effects of today’s widespread use of computation in scientific research is an increase in the complexity of workflows associated with scientific discoveries. A research process that uses computational tools and digital data introduces new potential sources of error: Were the methods described in the paper transcribed correctly into computer code? What were the parameters settings and input data files? How were the raw data filtered and prepared for analysis? Are the figures and tables produced by the code the same as reported in the published article? The list goes on. Access to the data and code that produced the result is necessary, both for replication and validation purposes as well as for reconciling differences in independent implementations [3].

Recent Federal Policy Changes: A Catalyst

Recent federal government actions have the potential to help the scientific community close this credibility gap. Along with the OSTP public access executive memorandum, the May 9 Presidential executive order entitled “Making Open and Machine Readable the New Default for Government Information” instructs the federal government to issue an Open Data Policy binding on federal agencies. The executive order pertains to government datasets created within agencies, of which many are and can be used for research.

Congress is currently considering the issue of public availability of federally funded data and publications arising from federally funded research in the reauthorization of the America COMPETES Act, which funds research programs at several federal agencies including the National Science Foundation and the Department of Energy. At the March 5 the Congressional Science, Space, and Technology subcommittee hearing on Scientific Integrity & Transparency, the chairman’s opening remarks stated:

The growing lack of scientific integrity and transparency has many causes but one thing is very clear: without open access to data, there can be neither integrity nor transparency from the conclusions reached by the scientific community. Furthermore, when there is no reliable access to data, the progress of science is impeded and leads to inefficiencies in the scientific discovery process. Important results cannot be verified, and confidence in scientific claims dwindles.

As with the OSTP executive memo and Presidential executive order, this also appears to be a boon to scientific transparency. The bill is being drafted as this article goes to press, but there are in fact many barriers to legislating reproducible computational research that make the federal government’s efforts a partial solution at best.

The Limits of Federal Policy for Scientific Research

Because of the Federal government’s central role in sponsoring scientific research, federal policy makers can wield enormous influence over how that research is carried out and how it is accessed. Policy makers’ incentives and their range of potential actions can differ from those of the scientific community however.

The Federal Government is responsible to American voters as a whole. Researchers cannot expect policymakers to necessarily act in the best interests of the scientific community, even if we believe scientific progress benefits all of society. Scientific research is highly internationally collaborative, and as one house member asked during the March 5 hearing, “As we move into this era of wanting to share data, we also have to maintain our competitive advantage, and we do have competitor nations that every day are trying to get to our data, get to the research institutions… how [do we] we move forward in an open transparent way but maintaining our competitive advantage and protecting those discoveries that we’re making?” In addition, the policy rhetoric around public access to data and publication seems to be oriented primarily around accelerating economic growth rather than scientific integrity, as can be seen in phrases such as “[a]ccess to digital data sets resulting from federally funded research allows companies to focus resources and efforts on understanding and exploiting discoveries” and “[o]penness in government strengthens our democracy, promotes the delivery of efficient and effective services to the public, and contributes to economic growth.”

Federal funding agency regulation only directly affects federally funded research. Not all research is funded by the federal funding agencies. The policy changes discussed in this article directly regulate federal research grants and government produced datasets.

Lobbying groups can exert enormous influence over the government rule making process. Lobbyists can have a substantial impact on rulemaking, even in science policy. Non-researchers such as commercial publishers or rare disease research advocates often maintain a strong presence on Capitol Hill and are in regular contact with legislative aides drafting bills. Since the scientific community tends not to lobby, this can skew the outcome of the policy making process in ways that depart from longstanding scientific norms. Policymakers shouldn’t be assumed to have a deep knowledge of how the scientific research process works and scientists should be engaged in the policy making process when it concerns how science is carried out and disseminated.

The scope of potential federal action is constrained by legal barriers. When scientists discuss resolving the credibility crisis, their calls are for conveniently available data and code. Federal policy makers, by contrast, call for “public access” to data and to publications. These responses are differ in two principal ways. Federal policy makes data more widely available than scientists might have proposed, and neglects to mention software, methods, and reproducibility as a rational for sharing. While I believe few scientists would quibble with making the results of their research publicly available, reproducibility is a cornerstone of the scientific method. One possible reason code is largely absent from the federal policy discussion is that software is potentially patentable. The Bayh-Dole Act, passed in 1980, gives research institutions title to government-funded inventions, including those embodied in software, and charges them with using the patent system to encourage disclosure and commercialization of the inventions, making software harder to regulate than data or publications.

In short, the scientific community must meet its goals for reproducibility and transparency itself, rather than relying solely on the federal government.

Responses from the Scientific Community and Ways Forward

Aside from authoring calls and proposing standards for greater dissemination in computational science, the scientific community has taken a number of steps to bring about reproducible computational science. Some journals, such as this one, require authors to make the data and code required to reproduce the published results available upon request [11, 27]. I am involved with a Presidential Initiative of the American Statistical Association to develop a “portal” to make digital scholarly objects associated with statistics research publicly available. Many universities and funding agencies have created online repositories to support the dissemination of digital scholarly objects. A nascent community of software developers is making an increasing number of tools available for scientific workflow management and the dissemination of reproducible computational results [26, wiki].

It is not costless to prepare data and code for release [29,30]. Researchers are currently rewarded for manuscript publication, not data and code production, which has the effect of penalizing those that practice reproducible research [31]. One way to break this collection action problem is for community members to all move to new practices at the same time, and the changes in federal policy provide such an opportunity. This time is now. Journals must update their publication requirements to include the deposit of data and code that can replicate all results in the manuscript [gentleman]. These data and codes can be deposited at any repository that can reasonably guarantee a stable URL, to be provided in the text of the published paper. Scientists should expect to be able to replicate computational results, including when evaluating research for the purpose of making hiring and promotional decisions. When complex architectures have been used, sharing the code allows others to inspect and perhaps adapt the methods [32]. Reasons data or code cannot be shared, such as patient privacy, should be disclosed to journal editors at the time of submission and exceptions agreed upon early in the process [33]. The scientific community has an opportunity to extend the title of Presidential executive order and “Make Open the New Default for Published Computational Science” and take steps toward really reproducible computational research.

Policy deliberations are currently ongoing and input from the scientific community can impact outcomes that will affect the scientific research process. Congress is considering open data and publication requirements in the reauthorization of the American COMPETES Act. The OSTP executive memorandum directs the federal funding agencies to deliver a plan to the OSTP on August 22 of this year and direct each agency to “use a transparent process for soliciting views from stakeholders, including federally funded researchers, universities, libraries, publishers, users of federally funded research results, and civil society groups, and take such views into account.”


  • Victoria, thanks for sharing this piece. I think you tackle a very important issue in that these challenges should be addressed by the scientific community and scientific publishers as well and not left to federal mandates alone.

    I’d like to offer a little push-back on what I see as the two main reasons you highlight that changes should not be left to government: valuing reproducibility over economics, and the Bayh-Dole Act.

    While agree that the federal government and the research community may have differing specific objectives in regards to the value of sharing data and code, I think you might be painting a bit of a false dichotomy in your argument here. The government may measure the value in terms of economic benefit while researchers measure it more in terms of reproduciblity — but this does not necessarily imply any conflict of objectives. It may well be that the policies that best promote reproducibility also promote economic benefit.

    With regards to the Bayh-Dole act, I suspect you are quite right that this patent incentive is partly the reason the current memos have chosen not to address this issue, but I must quite disagree that this means we should expect researchers & journals to address the issue instead. In contrast to your other arguments in which the federal government is not best placed to exert the right influence, it is entirely the best-placed body to address the problem here — ideally by removing or substantially overhauling software patent law. Researchers & journals, by contrast, are much more likely to have their hands tied by pressure from technology transfer departments at their institutions. While we non-staff researchers can take advantage of their ownership of software to release it openly anyway and encourage others to do so, surely this is one case where federal action could have an invaluable contribution.

    Lastly, you mention ‘this journal’ having a policy of sharing on request, while pointing to the federal mandates taking a much more aggressive stance on being publicly available. You are not suggesting that ‘available on request’ is a decent policy for a journal to have? (Though obviously more palpable to most academics, every study I have seen shows embarrassingly low response rates to such requests. Formal repositories offer a much better solution).

    Overall I support your call for engagement from researchers and publishers, which is obviously happening as well, but feel you are too harsh on the value provided by federal mandates. Thanks for sharing this piece!

  • Victoria,

    Thanks for sharing your work freely,even if you couldn’t get it published in a timely manner. I’d put it on my CV in any case :)

    Following Dr. Boettiger’s discussion about software and where it fits in the federal open access policies, I’ll mention that the definition of ‘research data’ used by the government can certainly be construed to include software. From the OSTP Public Access Memo: “For
    purposes of this memorandum, data is defined, consistent with OMB circular A-110, as the digital recorded factual material commonly accepted in the scientific community as necessary to validate research findings including data sets used to support scholarly publications…” I think that encompasses software!

    This is not to say the Bayh-Dole doesn’t complicate matters with regards to patenting such software, of course, but this definition does give an opening for funding agencies to ask about software.

    I am left to wonder how many researchers look to patent their software…typically making scientific/research software something marketable would require a lot of work, presuming that software even has economic value.


  • Great article. A shame it wasn’t ‘published’. At least now it’s published :-)

    I think you only need to drop the ‘computational’ throughout the text and it would be very much up to date :-)

    In fact, we are working at our institution with our library and computing center to develop standards for text, data and software infrastructure. The plan is, that when you arrive at a new university, it’s not “here’s your email and your webspace” any more, but “here’s your access codes for your papers, data and software. Oh, and if you want email and webspace, here’s the URL”.

    It’s quite embarrassing that of the three things scholars produce: text, software and data, not even one is supported by even a token of institutional infrastructure. And this is 2013, not 1913.

