Archive for the 'Open Data' Category

My input for the OSTP RFI on reproducibility

Until Sept 23 2014, the US Office of Science and Technology Policy in the Whitehouse was accepting comments on their “Strategy for American Innovation.” My submitted comments on one part of that RFI, section 11:

“11) Given recent evidence of the irreproducibility of a surprising number of published scientific findings, how can the Federal Government leverage its role as a significant funder of scientific research to most effectively address the problem?”

follow (corrected for typos).

This comment is directed at point 11, requesting comments on the reproducibility of scientific findings. I believe there are two threads to this issue: a traditional problem that has been in science for hundreds of years whose traditional solution has been the methods section in the scientific publication; secondly, a new issue that has arisen over the last twenty years as computation has assumed a central role in scientific research. This new element is not yet accommodated in scientific publication, and introduces serious consequences for reproducibility.

Putting aside the first issue of traditional reproducibility, for which longstanding solutions exist, I encourage the federal government, in concert with the scientific community, to consider how the current set of laws and funding agency practices do not support the production of reproducible computational science.

In all research that utilizes a computer, instructions for the research are stored in software and scientific data are stored digitally. A typical publication in computational research is based foundationally on data, and the computer instructions applied to the data that generated the scientific findings. The complexity of the data generation mechanism and the computational instructions is typically very large, too large to capture in a traditional scientific publication. Hence when computers are involved in the research process, scientific publication must shift from a scientific article to the triple of scientific paper, and the software and data from which the findings were generated. This triple has been referred to as a “research compendia” and its aim is to transmit research findings that others in the field will be able to reproduce by running the software on the data. Hence, data and software that permits others to reproducible the findings must be made available.

There are two primary laws come to bear on this idea of computational reproducibility. The first is copyright law, which adheres to software and to some degree to data. Software and data from scientific research should not receive the same legal protection as most original artistic works receive from copyright law. These objects should be made openly available by default (rather than closed by copyright law by default) with attribution for the creators.

Secondly, the Bayh-Dole Act from 1980 is having the effect of creating less transparency and less knowledge and technology transfer due to the use of the computer in scientific research. Bayh-Dole charges the institutions that support research, such as universities, to use the patent system for inventions that derive under its auspices. Since software may be patentable, this introduces a barrier to knowledge transfer and reproducibility. A research compendia would include code and would be made openly available, where as Bayh-Dole adds an incentive to create a barrier by introducing the option to patent software. Rather than openly available software, a request to license patented software would need to submitted to the University and appropriate rates negotiated. For the scientific community, this is equivalent to closed unusable code.

I encourage you to rethink the legal environment that attends to the digital objects produced by scientific research in support of research findings: the software; the data; and the digital article. Science, as a rule, demands that these be made openly available to society (as do scientists) and unfortunately they are frequently captured by external third parties, using copyright transfer and patents, that restrict access to knowledge and information that has arisen from federal funding. This retards American innovation and competitiveness.

Federal funding agencies and other government entities must financially support the sharing, access, and long term archiving of research data and code that supports published results. With guiding principles from the federal government, scientific communities should implement infrastructure solutions that support openly available reproducible computational research. There are best practices in most communities regarding data and code release for reproducibility. Federal action is needed since the scientific community faces a collection action problem: producing research compendia, as opposed to a published article alone, is historically unrewarded. In order to change this practice, the scientific community must move in concert. The levers exerted by the federal funding agencies are key to breaking this collective action problem.

Finally, I suggest a different wording for point 11 in your request. Scientific findings are not the level at which to think about reproducibility, it is better to think about enabling the replication of the research process that is associated with published results, rather than the findings themselves. This is what provides for research that is reproducible and reliable. When different processes are compared, whether or not they produce the same result, the availability of code and data will enable the reconciliation of differences in methods. Open data and code permit reproducibility in this sense and increase the reliability of the scholarly record by permitting error detection and correction.

I have written extensively on all these issues. I encourage you to look at, especially the papers and talks.

Changes in the Research Process Must Come From the Scientific Community, not Federal Regulation

I wrote this piece as an invited policy article for a major journal but they declined to publish it. It’s still very much a draft and they made some suggestions, but since realistically I won’t be able to get back to this for a while and the text is becoming increasingly dated, I thought I would post it here. Enjoy!

Recent U.S. policy changes are mandating a particular vision of scientific communication: public access to data and publications for federally funded research. On February 22, 2013, the Office of Science and Technology Policy (OSTP) in the Whitehouse released an executive memorandum instructing the major federal funding agencies to develop plans to make both the datasets and research articles resulting from their grants publicly available [1]. On March 5, the House Science, Space, and Technology subcommittee convened a hearing on Scientific Integrity & Transparency and on May 9, President Obama issued an executive order requiring government data to be made openly available to the public [2].

Many in the scientific community have demanded increased data and code disclosure in scholarly dissemination to address issues of reproducibility and credibility in computational science [3-19]. At first blush, the federal policies changes appear to support these scientific goals, but the scope of government action is limited in ways that impair its ability to respond directly to these concerns. The scientific community cannot rely on federal policy to bring about changes that enable reproducible computational research. These recent policy changes must be a catalyst for a well-considered update in research dissemination standards by the scientific community: computational science must move to publication standards that include the digital data and code sufficient to permit others in the field to replicate and verify the results. Authors and journals must be ready to use existing repositories and infrastructure to ensure the communication of reproducible computational discoveries.
Continue reading ‘Changes in the Research Process Must Come From the Scientific Community, not Federal Regulation’

Data access going the way of journal article access? Insist on open data

The discussion around open access to published scientific results, the Open Access movement, is well known. The primary cause of the current situation — journal publishers owning copyright on journal articles and therefore charging for access — stems from authors signing their copyright over to the journals. I believe this happened because authors really didn’t realize what they were doing when they signed away ownership over their work, and had they known they would not have done so. I believe another solution would have been used, such as granting the journal a license to publish i.e. like Science’s readily available alternative license. At some level authors were entering into binding legal contracts without an understanding of the implications and without the right counsel.

I am seeing a similar situation arising with respect to data. It is not atypical for a data producing entity, particularly those in the commercial sphere, to require that researchers with access to the data sign a non-disclosure agreement. This seems to be standard for Facebook data, Elsevier data, and many many others. I’m witnessing researchers grabbing their pens and signing, and like in the publication context, feeling themselves powerless to do otherwise. Again, they are without the appropriate counsel. Even the general counsel’s office at their institution typically sees the GC’s role as protecting the institution against liability, rather than the larger concern of protecting the scholar’s work and the integrity of the scholarly record. What happens when research from these protected datasets is published, and questioned? How can others independently verify the findings? They’ll need access to the data.

There are many legitimate reasons such data may not be able to be publicly released, for example protection of subjects’ privacy (see what happened when Harvard released Facebook data from a study). But as scientists we should be mindful of the need for our published findings to be reproducible. Some commercial data do not come with privacy concerns, only concerns from the company that they are still able to sell the data to other commercial entities, and sometimes not even that. Sometimes lawyers simply want an NDA to minimize any risk to the commercial entity that might arise should the data be released. To me, that seems perfectly rational since they are not stewards of scientific knowledge.

It is also perfectly rational for authors publishing findings based on these data to push back as hard as possible to ensure maximum reproducibility and credibility of their results. Many companies share data with scientists because they seek to deepen goodwill and ties with the academic community, or they are interested in the results of the research. As researchers we should condition our acceptance of the data on its release when the findings are published, if there are no privacy concerns associated with the data. If there are privacy concerns I can imagine ensuring we can share the data in a “walled garden” within which other researchers, but not the public, will be able to access the data and verify results. There are a number of solutions that can bridge the gap between open access to data and an access-blocking NDA (e.g. differential privacy) and as scientists the integrity and reproducibility of our work is a core concern that we have responsibility for in this negotiation for data.

A few template data sharing agreements between academic researchers and data producing companies would be very helpful, if anyone feels like taking a crack at drafting them (Creative Commons?). Awareness of the issue is also important, among researchers, publishers, funders, and data producing entities. We cannot unthinkingly default to a legal situation regarding data that is anathema to scientific progress, as we did with access to scholarly publications.