This past January Obama signed the America COMPETES Re-authorization Act. It contains two interesting sections that advance the notions of open data and the federal role in supporting online access to scientific archives: 103 and 104, which read in part:
• § 103: “The Director [of the Office of Science and Technology Policy at the Whitehouse] shall establish a working group under the National Science and Technology Council with the responsibility to coordinate Federal science agency research and policies related to the dissemination and long-term stewardship of the results of unclassiﬁed research, including digital data and peer-reviewed scholarly publications, supported wholly, or in part, by funding from the Federal science agencies.” (emphasis added)
This is a cause for celebration insofar as Congress has recognized that published articles are an incomplete communication of computational scientific knowledge, and the data (and code) must be included as well.
• § 104: Federal Scientiﬁc Collections: The Office of Science and Technology Policy “shall develop policies for the management and use of Federal scientiﬁc collections to improve the quality, organization, access, including online access, and long-term preservation of such collections for the beneﬁt of the scientiﬁc enterprise.” (emphasis added)
I was very happy to see the importance of online access recognized, and hopefully this will include the data and code that underlies published computational results.
One step further in each of these directions: mention code explicitly and create a federally funded cloud not only for data but linked to code and computational results to enable reproducibility.
Stanley Young is Director of Bioinformatics at the National Institute for Statistical Sciences, and gave a talk in 2009 on problems in modern scientific research. For example: 1 in 20 NIH-funded studies actually replicates; closed data and opacity; model selection for significance; multiple comparisons.. Here is the link to his talk: Everything Is Dangerous: A Controversy. There are a number of good examples in the talk and Young anticipates and is more intellectually coherent than the New Yorker article The Truth Wears Off if you were interested in that.
Idea: Generalize clinicaltrials.gov, where scientists register their hypotheses prior to carrying out their experiment. Why not do this for all hypothesis tests? Have a site where the hypotheses are logged and time stamped before researchers gather the data or carry out the actual hypothesis testing for the project. I’ve heard this idea mentioned occasionally and both Young and Lehrer mentions it as well.
The Nature journal Molecular Systems Biology published an editorial “From Bench to Website” explaining their move to a transparent system of peer review. Anonymous referee reports, editorial decisions, and author responses are published alongside the final published paper. When this exchange is published, care is taken to preserve anonymity of reviewers and to not disclose any unpublished results. Authors also have the ability to opt out and request their review information not be published at all.
Here’s an example of the commentary that is being published alongside the final journal article.
Their move follows on a similar decision taken by The EMBO Journal (European Molecular Biology Organization) as described in an editorial here where they state that the “transparent editorial process will make the process that led to acceptance of a paper accessible to all, as well as any discussion of merits and issues with the paper.” Their reasoning cites problems in the process of scientific communication and they give an example by Martin Raff which was published as a letter to the editor called “Painful Publishing” (behind a paywall, apologies). Raff laments the power of the anonymous reviewers to demand often unwarranted additional experimentation as a condition of publication: “authors are so keen to publish in these select journals that they are willing to carry out extra, time consuming experiments suggested by referees, even when the results could strengthen the conclusions only marginally. All too often, young scientists spend many months doing such ‘referees’ experiments.’ Their time and effort would frequently be better spent trying to move their project forward rather than sideways. There is also an inherent danger in doing experiments to obtain results that a referee demands to see.”
Rick Trebino, physics professor at Georgia Tech, penned a note detailing the often incredible steps he went through in trying to publish a scientific comment: “How to Publish a Scientific Comment in 1 2 3 Easy Steps.” It describes deep problems in our scientific discourse today. The recent clinical trials scandal at Duke University is another example of failed scientific communication. Many efforts were made to print correspondences regarding errors in published papers that may have permitted problems in the research to have been addressed earlier.
The editorial in Molecular Systems Biology also announces that the journal is joining many others in adopting a policy of encouraging the upload of the data that underlies results in the paper to be published alongside the final article. They go one step further and provide links from the figure in the paper to its underlying data. They give an example of such linked figures here. My question is how this dovetails with recent efforts by Donoho and Gavish to create a system of universal figure-level identifiers for published results, and the work of Altman and King to design Universal Numerical Fingerprints (UNFs) for data citation.