Archive for the 'Open Science' Category

My input for the OSTP RFI on reproducibility

Until Sept 23 2014, the US Office of Science and Technology Policy in the Whitehouse was accepting comments on their “Strategy for American Innovation.” My submitted comments on one part of that RFI, section 11:

“11) Given recent evidence of the irreproducibility of a surprising number of published scientific findings, how can the Federal Government leverage its role as a significant funder of scientific research to most effectively address the problem?”

follow (corrected for typos).

This comment is directed at point 11, requesting comments on the reproducibility of scientific findings. I believe there are two threads to this issue: a traditional problem that has been in science for hundreds of years whose traditional solution has been the methods section in the scientific publication; secondly, a new issue that has arisen over the last twenty years as computation has assumed a central role in scientific research. This new element is not yet accommodated in scientific publication, and introduces serious consequences for reproducibility.

Putting aside the first issue of traditional reproducibility, for which longstanding solutions exist, I encourage the federal government, in concert with the scientific community, to consider how the current set of laws and funding agency practices do not support the production of reproducible computational science.

In all research that utilizes a computer, instructions for the research are stored in software and scientific data are stored digitally. A typical publication in computational research is based foundationally on data, and the computer instructions applied to the data that generated the scientific findings. The complexity of the data generation mechanism and the computational instructions is typically very large, too large to capture in a traditional scientific publication. Hence when computers are involved in the research process, scientific publication must shift from a scientific article to the triple of scientific paper, and the software and data from which the findings were generated. This triple has been referred to as a “research compendia” and its aim is to transmit research findings that others in the field will be able to reproduce by running the software on the data. Hence, data and software that permits others to reproducible the findings must be made available.

There are two primary laws come to bear on this idea of computational reproducibility. The first is copyright law, which adheres to software and to some degree to data. Software and data from scientific research should not receive the same legal protection as most original artistic works receive from copyright law. These objects should be made openly available by default (rather than closed by copyright law by default) with attribution for the creators.

Secondly, the Bayh-Dole Act from 1980 is having the effect of creating less transparency and less knowledge and technology transfer due to the use of the computer in scientific research. Bayh-Dole charges the institutions that support research, such as universities, to use the patent system for inventions that derive under its auspices. Since software may be patentable, this introduces a barrier to knowledge transfer and reproducibility. A research compendia would include code and would be made openly available, where as Bayh-Dole adds an incentive to create a barrier by introducing the option to patent software. Rather than openly available software, a request to license patented software would need to submitted to the University and appropriate rates negotiated. For the scientific community, this is equivalent to closed unusable code.

I encourage you to rethink the legal environment that attends to the digital objects produced by scientific research in support of research findings: the software; the data; and the digital article. Science, as a rule, demands that these be made openly available to society (as do scientists) and unfortunately they are frequently captured by external third parties, using copyright transfer and patents, that restrict access to knowledge and information that has arisen from federal funding. This retards American innovation and competitiveness.

Federal funding agencies and other government entities must financially support the sharing, access, and long term archiving of research data and code that supports published results. With guiding principles from the federal government, scientific communities should implement infrastructure solutions that support openly available reproducible computational research. There are best practices in most communities regarding data and code release for reproducibility. Federal action is needed since the scientific community faces a collection action problem: producing research compendia, as opposed to a published article alone, is historically unrewarded. In order to change this practice, the scientific community must move in concert. The levers exerted by the federal funding agencies are key to breaking this collective action problem.

Finally, I suggest a different wording for point 11 in your request. Scientific findings are not the level at which to think about reproducibility, it is better to think about enabling the replication of the research process that is associated with published results, rather than the findings themselves. This is what provides for research that is reproducible and reliable. When different processes are compared, whether or not they produce the same result, the availability of code and data will enable the reconciliation of differences in methods. Open data and code permit reproducibility in this sense and increase the reliability of the scholarly record by permitting error detection and correction.

I have written extensively on all these issues. I encourage you to look at http://stodden.net, especially the papers and talks.

Changes in the Research Process Must Come From the Scientific Community, not Federal Regulation

I wrote this piece as an invited policy article for a major journal but they declined to publish it. It’s still very much a draft and they made some suggestions, but since realistically I won’t be able to get back to this for a while and the text is becoming increasingly dated, I thought I would post it here. Enjoy!

Recent U.S. policy changes are mandating a particular vision of scientific communication: public access to data and publications for federally funded research. On February 22, 2013, the Office of Science and Technology Policy (OSTP) in the Whitehouse released an executive memorandum instructing the major federal funding agencies to develop plans to make both the datasets and research articles resulting from their grants publicly available [1]. On March 5, the House Science, Space, and Technology subcommittee convened a hearing on Scientific Integrity & Transparency and on May 9, President Obama issued an executive order requiring government data to be made openly available to the public [2].

Many in the scientific community have demanded increased data and code disclosure in scholarly dissemination to address issues of reproducibility and credibility in computational science [3-19]. At first blush, the federal policies changes appear to support these scientific goals, but the scope of government action is limited in ways that impair its ability to respond directly to these concerns. The scientific community cannot rely on federal policy to bring about changes that enable reproducible computational research. These recent policy changes must be a catalyst for a well-considered update in research dissemination standards by the scientific community: computational science must move to publication standards that include the digital data and code sufficient to permit others in the field to replicate and verify the results. Authors and journals must be ready to use existing repositories and infrastructure to ensure the communication of reproducible computational discoveries.
Continue reading ‘Changes in the Research Process Must Come From the Scientific Community, not Federal Regulation’

Data access going the way of journal article access? Insist on open data

The discussion around open access to published scientific results, the Open Access movement, is well known. The primary cause of the current situation — journal publishers owning copyright on journal articles and therefore charging for access — stems from authors signing their copyright over to the journals. I believe this happened because authors really didn’t realize what they were doing when they signed away ownership over their work, and had they known they would not have done so. I believe another solution would have been used, such as granting the journal a license to publish i.e. like Science’s readily available alternative license. At some level authors were entering into binding legal contracts without an understanding of the implications and without the right counsel.

I am seeing a similar situation arising with respect to data. It is not atypical for a data producing entity, particularly those in the commercial sphere, to require that researchers with access to the data sign a non-disclosure agreement. This seems to be standard for Facebook data, Elsevier data, and many many others. I’m witnessing researchers grabbing their pens and signing, and like in the publication context, feeling themselves powerless to do otherwise. Again, they are without the appropriate counsel. Even the general counsel’s office at their institution typically sees the GC’s role as protecting the institution against liability, rather than the larger concern of protecting the scholar’s work and the integrity of the scholarly record. What happens when research from these protected datasets is published, and questioned? How can others independently verify the findings? They’ll need access to the data.

There are many legitimate reasons such data may not be able to be publicly released, for example protection of subjects’ privacy (see what happened when Harvard released Facebook data from a study). But as scientists we should be mindful of the need for our published findings to be reproducible. Some commercial data do not come with privacy concerns, only concerns from the company that they are still able to sell the data to other commercial entities, and sometimes not even that. Sometimes lawyers simply want an NDA to minimize any risk to the commercial entity that might arise should the data be released. To me, that seems perfectly rational since they are not stewards of scientific knowledge.

It is also perfectly rational for authors publishing findings based on these data to push back as hard as possible to ensure maximum reproducibility and credibility of their results. Many companies share data with scientists because they seek to deepen goodwill and ties with the academic community, or they are interested in the results of the research. As researchers we should condition our acceptance of the data on its release when the findings are published, if there are no privacy concerns associated with the data. If there are privacy concerns I can imagine ensuring we can share the data in a “walled garden” within which other researchers, but not the public, will be able to access the data and verify results. There are a number of solutions that can bridge the gap between open access to data and an access-blocking NDA (e.g. differential privacy) and as scientists the integrity and reproducibility of our work is a core concern that we have responsibility for in this negotiation for data.

A few template data sharing agreements between academic researchers and data producing companies would be very helpful, if anyone feels like taking a crack at drafting them (Creative Commons?). Awareness of the issue is also important, among researchers, publishers, funders, and data producing entities. We cannot unthinkingly default to a legal situation regarding data that is anathema to scientific progress, as we did with access to scholarly publications.

Don’t expect computer scientists to be on top of every use that’s found for computers, including scientific investigation

Computational scientists need to understand and assert their computational needs, and see that they are met.

I just read this excellent interview with Donald Knuth, inventor of TeX and the concept of literature literate programming, as well as author of the famous textbook, The Art of Computer Programming. When asked for comments on (the lack of) software development using multicore processing, he says something very interesting – that multicore technology isn’t that useful, except in a few applications such as “rendering graphics, breaking codes, scanning images, simulating physical and biological processes, etc.” This caught my eye because parallel processing is a key advance for data processing. Statistical analysis of data typically executes line by line through the data, making it ideal for multithreaded applications. This isn’t some obscure part of science either – most science carried out today has some element of digital data processing (although of course not always at scales that warrant implementing parallel processing).

Knuth then says that “all these applications [that use parallel processing] require dedicated code and special-purpose techniques, which will need to be changed substantially every few years.” As the state of our scientific knowledge changes so does our problem solving ability, requiring modification of code used to generate scientific discovery. If I’m reading him correctly, Knuth seems to think this makes such applications less relevant to mainstream computer science.

The discussion reminded me of comments made at the “Workshop on Algorithms for Modern Massive Datasets” at Stanford in June 2010. Researchers in scientific computation (a specialized subdiscipline of computational science, see the Institute for Computational and Mathematical Engineering at Stanford or UT Austin’s Institute for Computational Engineering and Sciences for examples) were lamenting the direction computer hardware architecture was taking toward facilitating certain particular problems, such as particular techniques for matrix inversion and hot topics in linear algebra.

As scientific discovery transforms into a deeply computational process, we computational scientists must be prepared to partner with computer scientists to develop tools suited to the needs of scientific knowledge creation, or develop these skills ourselves. I’ve written elsewhere on the need to develop software that natively supports scientific ends (especially for workflow sharing; see e.g. http://stodden.net/AMP2011 ) and this applies to hardware as well.

The nature of science in 2051

Right now scientific questions are chosen for study in a largely autocratic way. Typically grants for research on particular questions come from federal funding agencies, and scientists competitively apply with the money going to the chosen researcher via a peer review process.

I suspect, as the tools of online science become increasingly available, the real questions people face in their day to day lives will be more readily answered. If you think about all the things you do and decisions you make in a day, many of them don’t have a strong empirical basis. How you wash the dishes or do laundry, what foods are healthy, what environment to maintain in your house, what common illness remedies work best, who knows, but these types of questions, the ones that occur to you as you go about your daily business, aren’t prioritized in the investigatory model we have now for science. I predict that scientific investigation as a whole, not just that that is government funded, will move substantially toward providing answers to questions of local importance.

Regulatory steps toward open science and reproducibility: we need a science cloud

This past January Obama signed the America COMPETES Re-authorization Act. It contains two interesting sections that advance the notions of open data and the federal role in supporting online access to scientific archives: 103 and 104, which read in part:

• § 103: “The Director [of the Office of Science and Technology Policy at the Whitehouse] shall establish a working group under the National Science and Technology Council with the responsibility to coordinate Federal science agency research and policies related to the dissemination and long-term stewardship of the results of unclassified research, including digital data and peer-reviewed scholarly publications, supported wholly, or in part, by funding from the Federal science agencies.” (emphasis added)

This is a cause for celebration insofar as Congress has recognized that published articles are an incomplete communication of computational scientific knowledge, and the data (and code) must be included as well.

• § 104: Federal Scientific Collections: The Office of Science and Technology Policy “shall develop policies for the management and use of Federal scientific collections to improve the quality, organization, access, including online access, and long-term preservation of such collections for the benefit of the scientific enterprise.” (emphasis added)

I was very happy to see the importance of online access recognized, and hopefully this will include the data and code that underlies published computational results.

One step further in each of these directions: mention code explicitly and create a federally funded cloud not only for data but linked to code and computational results to enable reproducibility.

Generalize clinicaltrials.gov and register research hypotheses before analysis

Stanley Young is Director of Bioinformatics at the National Institute for Statistical Sciences, and gave a talk in 2009 on problems in modern scientific research. For example: 1 in 20 NIH-funded studies actually replicates; closed data and opacity; model selection for significance; multiple comparisons.. Here is the link to his talk: Everything Is Dangerous: A Controversy. There are a number of good examples in the talk and Young anticipates and is more intellectually coherent than the New Yorker article The Truth Wears Off if you were interested in that.

Idea: Generalize clinicaltrials.gov, where scientists register their hypotheses prior to carrying out their experiment. Why not do this for all hypothesis tests? Have a site where the hypotheses are logged and time stamped before researchers gather the data or carry out the actual hypothesis testing for the project. I’ve heard this idea mentioned occasionally and both Young and Lehrer mentions it as well.

Open peer review of science: a possibility

The Nature journal Molecular Systems Biology published an editorial “From Bench to Website” explaining their move to a transparent system of peer review. Anonymous referee reports, editorial decisions, and author responses are published alongside the final published paper. When this exchange is published, care is taken to preserve anonymity of reviewers and to not disclose any unpublished results. Authors also have the ability to opt out and request their review information not be published at all.

Here’s an example of the commentary that is being published alongside the final journal article.

Their move follows on a similar decision taken by The EMBO Journal (European Molecular Biology Organization) as described in an editorial here where they state that the “transparent editorial process will make the process that led to acceptance of a paper accessible to all, as well as any discussion of merits and issues with the paper.” Their reasoning cites problems in the process of scientific communication and they give an example by Martin Raff which was published as a letter to the editor called “Painful Publishing” (behind a paywall, apologies). Raff laments the power of the anonymous reviewers to demand often unwarranted additional experimentation as a condition of publication: “authors are so keen to publish in these select journals that they are willing to carry out extra, time consuming experiments suggested by referees, even when the results could strengthen the conclusions only marginally. All too often, young scientists spend many months doing such ‘referees’ experiments.’ Their time and effort would frequently be better spent trying to move their project forward rather than sideways. There is also an inherent danger in doing experiments to obtain results that a referee demands to see.”

Rick Trebino, physics professor at Georgia Tech, penned a note detailing the often incredible steps he went through in trying to publish a scientific comment: “How to Publish a Scientific Comment in 1 2 3 Easy Steps.” It describes deep problems in our scientific discourse today. The recent clinical trials scandal at Duke University is another example of failed scientific communication. Many efforts were made to print correspondences regarding errors in published papers that may have permitted problems in the research to have been addressed earlier.

The editorial in Molecular Systems Biology also announces that the journal is joining many others in adopting a policy of encouraging the upload of the data that underlies results in the paper to be published alongside the final article. They go one step further and provide links from the figure in the paper to its underlying data. They give an example of such linked figures here. My question is how this dovetails with recent efforts by Donoho and Gavish to create a system of universal figure-level identifiers for published results, and the work of Altman and King to design Universal Numerical Fingerprints (UNFs) for data citation.

Science and Video: a roadmap

Once again I find myself in the position of having collected slides from talks, and having audio from the sessions. I need a simple way to pin these together so they form a coherent narrative and I need a common sharing platform. We don’t really have to see the speaker to understand the message but we needs the slides and the audio to play in tandem with the slides changing at the correct points. Some of the files are quite large: slides decks can be over 100MB and right now the audio file I have is 139MB (slideshare has size limits that don’t accomodate this).

I’m writing because I feel the messages are important, and need to be available to a wider audience. This is often our culture, our heritage, our technology, our scientific knowledge and our shared understanding. These presentations need to be available not just on principled open access grounds, but it is imperative that other scientists hear these messages as well, amplifying scientific communication.

At a bar the other night a friend and I came up with the idea of S-SPAN: a C-SPAN for science. Talks and conferences could be filmed and shared widely on an internet platform. Of course these platforms exist and some even target scientific talks but the content also needs to be marshalled and directed onto the website. Some of the best stuff I’ve even seen has floated into the ether.

So, I make an open call for these two tasks: a simple tool to pin together slides and audio (and sides and video), and an effort to collate video from scientific conference talks and film them if it doesn’t exist, all onto a common distribution platform. S-SPAN could start as raw and underproduced as C-SPAN, but I am sure it would develop from there.

I’m looking at you, YouTube.

My Symposium at the AAAS Annual Meeting: The Digitization of Science

Yesterday I held a symposium at the AAAS Annual Meeting in Washington DC, called “The Digitization of Science: Reproducibility and Interdisciplinary Knowledge Transfer,” that was intended to bring attention to how massive computation is changing the practice of science, particularly the lack of reproducibility of published computational scientific results. The fact is, most computational scientific results published today are unverified and unverifiable. I’ve created a page for the event here, with links to slide decks and abstracts. I couldn’t have asked for a better symposium, thanks to the wonderful speakers.

The first speaker was Keith A. Baggerly, who (now famously) tried to verify published results in Nature Medicine and uncovered a series of errors that led to the termination of clinical trials at Duke that were based on the original findings, and the resignation of one of the investigators (his slides). I then spoke about policies for realigning the IP framework scientists are under with their longstanding norms, to permit sharing of code and data (my slides). Fernando Perez described how computational scientists can learn about not only code sharing, quality control, and project management from the Open Source Software, but how they have in fact developed what is in effect a deeply successful system of peer review for code. Code is verified line by line before incorporated into the project, and there are software tools to enable the communication between reviewer and submitted, down to the line of code (his slides).

Michael Reich then presented GenePattern, an OS independent tool developed with Microsoft for creating data analysis pipelines and incorporating them into a Word doc. Once in the document, tools exist to click and recreate the figure from the pipeline and examine what’s been done to the data. Robert Gentlemen advocated the entire research paper as the unit of reproducibility, and David Donoho presented a method for assigning a unique identifier to figures within the paper, that creates a link for each figure and permits its independent reproduction (the slides). The final speaker was Mark Liberman, who showed how the human language technology community had developed a system of open data and code in their efforts to reduce errors in machine understanding of language (his slides). All the talks pushed on delineations of science from non-science, and it was probably best encapsulated with a quote Mark introduced from John Pierce, a Bell Labs executive in 1969, how “To sell suckers, one uses deceit and offers glamor.”

There was some informal feedback, with a prominent person saying that this session was “one of the most amazing set of presentations I have attended in recent memory.” Have a look at all the slides and abstracts, including links and extended abstracts.

Update: Here are some other blog posts on the symposium: Mark Liberman’s blog and Fernando Perez’s blog.

Letter Re Software and Scientific Publications – Nature

Mark Gerstein and I penned a reaction to two pieces published in Nature News last October, “Publish your computer code: it is good enough,” by Nick Barnes and “Computational Science…. Error” by Zeeya Merali. Nature declined to publish our note and so here it is.

Dear Editor,

We have read with great interest the recent pieces in Nature about the importance of computer codes associated with scientific manuscripts. As participants in the Yale roundtable mentioned in one of the pieces, we agree that these codes must be constructed robustly and distributed widely. However, we disagree with an implicit assertion, that the computer codes are a component separate from the actual publication of scientific findings, often neglected in preference to the manuscript text in the race to publish. More and more, the key research results in papers are not fully contained within the small amount of manuscript text allotted to them. That is, the crucial aspects of many Nature papers are often sophisticated computer codes, and these cannot be separated from the prose narrative communicating the results of computational science. If the computer code associated with a manuscript were laid out according to accepted software standards, made openly available, and looked over as thoroughly by the journal as the text in the figure legends, many of the issues alluded to in the two pieces would simply disappear overnight.

The approach taken by the journal Biostatistics serves as an exemplar: code and data are submitted to a designated “reproducibility editor” who tries to replicate the results. If he or she succeeds, the first page of the article is kitemarked “R” (for reproducible) and the code and data made available as part of the publication. We propose that high-quality journals such as Nature not only have editors and reviewers that focus on the prose of a manuscript but also “computational editors” that look over computer codes and verify results. Moreover, many of the points made here in relation to computer codes apply equally well to large datasets that underlie experimental manuscripts. These are often organized, formatted, and deposited into databases as an afterthought. Thus, one could also imagine a “data editor” who would look after these aspects of a manuscript. All in all, we have to come to the realization that current scientific papers are more complicated than just a few thousand words of narrative text and a couple of figures, and we need to update journals to handle this reality.

Yours sincerely,

Mark Gerstein (1,2,3)
Victoria Stodden (4)

(1) Program in Computational Biology and Bioinformatics,
(2) Department of Molecular Biophysics and Biochemistry, and
(3) Department of Computer Science,
Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520 Mark.Gerstein@Yale.edu

(4) Department of Statistics, Columbia University, 1255 Amsterdam Ave, New York, NY 10027
vcs@stodden.net

Open Data Dead on Arrival

In 1984 Karl Popper wrote a private letter to an inquirer he didn’t know, responding to enclosed interview questions. The response was subsequently published and in it he wrote, among other things, that:

“Every intellectual has a very special responsibility. He has the privilege and opportunity of studying. In return, he owes it to his fellow men (or ‘to society’) to represent the results of his study as simply, clearly and modestly as he can. The worst thing that intellectuals can do — the cardinal sin — is to try to set themselves up as great prophets vis-a-vis their fellow men and to impress them with puzzling philosophies. Anyone who cannot speak simply and clearly should say nothing and continue to work until he can do so.”

Aside from the offensive sexism in referring to intellectuals as males, there is another way this imperative should be updated for intellectualism today. The movement to make data available online is picking up momentum — as it should — and open code is following suit (see http://mloss.org for example). But data should not be confused with facts, and applying the simple communication that Popper refers to beyond the written or spoken word is the only way open data will produce dividends. It isn’t enough to post raw data, or undocumented code. Data and code should be considered part of intellectual communication, and made as simple as possible for “fellow men” to understand. Just as knowledge of adequate English vocabulary is assumed in the nonquantitative communication Popper refers to, certain basic coding and data knowledge can be assumed as well. This means the same thing as it does in the literary case; the elimination of extraneous information and obfuscating terminology. No need to bury interested parties in an Enron-like shower of bits. It also means using a format for digital communication that is conducive to reuse, such as a flat text file or another non-proprietary format, for example pdf files cannot be considered acceptable to either data or code. Facilitating reproducibility must be the gold standard for data and code release.

And who are these “fellow men”?

Well, fellow men and women that is, but back to the issue. Much of the history of scientific communication has dealt with the question of demarcation of the appropriate group to whom the reasoning behind the findings would be communicated, the definition of the scientific community. Clearly, communication of very technical and specialized results to a layman would take intellectuals’ time away from doing what they do best, being intellectual. On the other hand some investment in explanation is essential for establishing a finding as an accepted fact — assuring others that sufficient error has been controlled for and eliminated in the process of scientific discovery. These others ought to be able to verify results, find mistakes, and hopefully build on the results (or the gaps in the theory) and thereby further our understanding. So there is a tradeoff. Hence the establishment of the Royal Society for example as a body with the primary purpose of discussing scientific experiments and results. Couple this with Newton’s surprise, or even irritation, at having to explain results he put forth to the Society in his one and only journal publication in their journal Philosophical Transactions (he called the various clarifications tedious, and sought to withdraw from the Royal Society and subsequently never published another journal paper. See the last chapter of The Access Principle). There is a mini-revolution underfoot that has escaped the spotlight of attention on open data, open code, and open scientific literature. That is, the fact that the intent is to open to the public. Not open to peers, or appropriately vetted scientists, or selected ivory tower mates, but to anyone. Never before has the standard for communication been “everyone,” in fact quite the opposite. Efforts had traditionally been expended narrowing and selecting the community privileged enough to participate in scientific discourse.

So what does public openness mean for science?

Recall the leaked files from the University of East Anglia’s Climatic Research Unit last November. Much of the information revealed concerned scientifically suspect (and ethically dubious) attempts not to reveal data and methods underlying published results. Although that tack seems to have softened now some initial responses defended the climate scientists’ right to be closed with regard to their methods due to the possibility of “denial of service attacks” – the ripping apart of methodology (recall all science is wrong, an asymptotic progression toward to truth at best) not with the intent of finding meaningful errors that halt the acceptance of findings as facts, but merely to tie up the climate scientists so they cannot attend to real research. This is the same tradeoff as described above. An interpretation of this situation cannot be made without the complicating realization that peer review — the review process that vets articles for publication — doesn’t check computational results but largely operates as if the papers are expounding results from the pre-computational scientific age. The outcome, if computational methodologies are able to remain closed from view, is that they are directly vetted nowhere. Hardly an acceptable basis for establishing facts. My own view is that data and code must be communicated publicly with attention paid to Popper’s admonition: as simply and clearly as possible, such that the results can be replicated. Not participating in dialog with those insufficiently knowledgable to engage will become part of our scientific norms, in fact this is enshrined in the structure of our scientific societies of old. Others can take up those ends of the discussion, on blogs, in digital forums. But public openness is important not just because taxpayers have a right to what they paid for (perhaps they do, but this quickly falls apart since not all the public are technically taxpayers and that seems a wholly unjust way of deciding who shall have access to scientific knowledge and who not, clearly we mean society), but because of the increasing inclusiveness of the scientific endeavor. How do we determine who is qualified to find errors in our scientific work? We don’t. Real problems will get noticed regardless of with whom they originate, many eyes making all bugs shallow. And I expect peer review for journal publishing to incorporate computational evaluation as well.

Where does this leave all the open data?

Unused, unless efforts are expended to communicate the meaning of the data, and to maximize the usability of the code. Data is not synonymous with facts – methods for understanding data, and turning its contents into facts, are embedded within the documentation and code. Take for granted that users understand the coding language or basic scientific computing functions, but clearly and modestly explain the novel contributions. Facilitate reproducibility. Without this data may be open, but will remain de facto in the ivory tower.

Ars technica article on reproducibility in science

John Timmer wrote an excellent article called “Keeping computers from ending science’s reproducibility.” I’m quoted in it. Here’s an excellent follow up blog post by Grant Jacobs, “Reproducible Research and computational biology.”

Code Repository for Machine Learning: mloss.org

The folks at mloss.org — Machine Leaning Open Source Software — invited a blog post on my roundtable on data and code sharing, held at Yale Law School last November. mloss.org’s philosophy is stated as:

“Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for a wide range of applications. Inspired by similar efforts in bioinformatics (BOSC) or statistics (useR), our aim is to build a forum for open source software in machine learning.”

The site is excellent and worth a visit. The guest blog Chris Wiggins and I wrote starts:

“As pointed out by the authors of the mloss position paper [1] in 2007, “reproducibility of experimental results is a cornerstone of science.” Just as in machine learning, researchers in many computational fields (or in which computation has only recently played a major role) are struggling to reconcile our expectation of reproducibility in science with the reality of ever-growing computational complexity and opacity. [2-12]

In an effort to address these questions from researchers not only from statistical science but from a variety of disciplines, and to discuss possible solutions with representatives from publishing, funding, and legal scholars expert in appropriate licensing for open access, Yale Information Society Project Fellow Victoria Stodden convened a roundtable on the topic on November 21, 2009. Attendees included statistical scientists such as Robert Gentleman (co-developer of R) and David Donoho, among others.”

keep reading at http://mloss.org/community/blog/2010/jan/26/data-and-code-sharing-roundtable/. We made an effort to reference efforts in other fields regarding reproducibility in computational science.

Video from "The Great Climategate Debate" held at MIT December 10, 2009

This is an excellent panel discussion regarding the leaked East Anglia docs as well as standards in science and the meaning of the scientific method. It was recorded on Dec 10, 2009, and here’s the description from the MIT World website: “The hacking of emails from the University of East Anglia’s Climate Research Unit in November rocked the world of climate change science, energized global warming skeptics, and threatened to derail policy negotiations at Copenhagen. These panelists, who differ on the scientific implications of the released emails, generally agree that the episode will have long-term consequences for the larger scientific community.”

Moderator: Henry D. Jacoby, Professor of Management, MIT Sloan School of Management, and Co-Director, Joint Program on the Science and Policy of Global Change, MIT.

Panelists:
Kerry Emanuel, Breene M. Kerr Professor of Atmospheric Science, Department of Earth, Atmospheric Science and Planetary Sciences, MIT;
Judith Layzer, Edward and Joyce Linde Career Development Associate Professor of Environmental Policy, Department of Urban Studies and Planning, MIT;
Stephen Ansolabehere, Professor of Political Science, MIT, and
Professor of Government, Harvard University;
Ronald G. Prinn, TEPCO Professor of Atmospheric Science, Department of Earth, Atmospheric and Planetary Sciences, MIT Director, Center for Global Change Science; Co-Director of the MIT Joint Program on the Science and Policy of Global Change;
Richard Lindzen, Alfred P. Sloan Professor of Meteorology, Department of Earth, Atmospheric and Planetary Sciences, MIT.

Video, running at nearly 2 hours, is available at http://mitworld.mit.edu/video/730.

My answer to the Edge Annual Question 2010: How is the Internet Changing the Way You Think?

At the end of every year editors at my favorite website The Edge ask intellectuals to answer a thought-provoking question. This year it was “How is the internet changing the way you think?” My answer is posted here:
http://www.edge.org/q2010/q10_15.html#stodden

Post 3: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the OSTP’s call as posted here: http://www.ostp.gov/galleries/default-file/RFI%20Final%20for%20FR.pdf. The first wave, comments posted here, asked for feedback on implementation issues. The second wave requested input on Features and Technology (our post is here). For the third and final wave on Management, Chris Wiggins, Matt Knepley, and I posted the following comments:

Q1: Compliance. What features does a public access policy need to ensure compliance? Should this vary across agencies?

One size does not fit all research problems across all research communities, and a heavy-handed general release requirement across agencies could result in de jure compliance – release of data and code as per the letter of the law – without the extra effort necessary to create usable data and code facilitating reproducibility (and extension) of the results. One solution to this barrier would be to require grant applicants to formulate plans for release of the code and data generated through their research proposal, if funded. This creates a natural mechanism by which grantees (and peer reviewers), who best know their own research environments and community norms, contribute complete strategies for release. This would allow federal funding agencies to gather data on needs for release (repositories, further support, etc.); understand which research problem characteristics engender which particular solutions, which solutions are most appropriate in which settings, and uncover as-yet unrecognized problems particular researchers may encounter. These data would permit federal funding agencies to craft release requirements that are more sensitive to barriers researchers face and the demands of their particular research problems, and implement strategies for enforcement of these requirements. This approach also permits researchers to address confidentiality and privacy issues associated with their research.

Examples:

One exemplary precedent by a UK funding agency is the January 2007 “Policy on data management and sharing”
(http://www.wellcome.ac.uk/About-us/Policy/Policy-and-position-statements/WTX035043.htm)
adopted by The Wellcome Trust (http://www.wellcome.ac.uk/About-us/index.htm) according to which “the Trust will require that the applicants provide a data management and sharing plan as part of their application; and review these data management and sharing plans, including any costs involved in delivering them, as an integral part of the funding decision.” A comparable policy statement by US agencies would be quite useful in clarifying OSTP’s intent regarding the relationship between publicly-supported research and public access to the research products generated by this support.

Continue reading ‘Post 3: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government’

Post 2: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the second wave of the OSTP’s call as posted here: http://www.ostp.gov/galleries/default-file/RFI%20Final%20for%20FR.pdf. The first wave, comments posted here and on the OSTP site here (scroll to the second last comment), asked for feedback on implementation issues. The second wave requests input on Features and Technology and Chris Wiggins and I posted the following comments:

We address each of the questions for phase two of OSTP’s forum on public access in turn. The answers generally depend on the community involved and (particularly question 7, asking for a cost estimate) on the scale of implementation. Inter-agency coordination is crucial however in (i) providing a centralized repository to access agency-funded research output and (ii) encouraging and/or providing a standardized tagging vocabulary and structure (as discussed further below).

Continue reading ‘Post 2: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government’

Nathan Myhrvold advocates for Reproducible Research on CNN

On yesterday’s edition of Fareed Zakaria’s GPS on CNN former Microsoft CTO and current CEO of Intellectual Ventures Nathan Myhrvold said reproducible research is an important response for climate science in the wake of Climategate, the recent file leak from a major climate modeling center in England (I blogged my response to the leak here). The video is here, see especially 16:27, and the transcript is here.

The OSTP's call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the OSTP’s call as posted here: http://www.ostp.gov/galleries/default-file/RFI%20Final%20for%20FR.pdf:

Open access to our body of federally funded research, including not only published papers but also any supporting data and code, is imperative, not just for scientific progress but for the integrity of the research itself. We list below nine focus areas and recommendations for action.

Continue reading ‘The OSTP's call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government’

The Climate Modeling Leak: Code and Data Generating Published Results Must be Open and Facilitate Reproducibility

On November 20 documents including email and code spanning more than a decade were leaked from the Computing Climatic Research Unit (CRU) at East Anglia University in the UK.

The Leak Reveals a Failure of Reproducibility of Computational Results

It appears as though the leak came about through a long battle to get the CRU scientists to reveal the code and data associated with published results, and highlights a crack in the scientific method as practiced in computational science. Publishing standards have not yet adapted to the relatively new computational methods used pervasively across scientific research today.

Other branches of science have long-established methods to bring reproducibility into their practice. Deductive or mathematical results are published only with proofs, and there are long established standards for an acceptable proof. Empirical science contains clear mechanisms for communication of methods with the goal of facilitation of replication. Computational methods are a relatively new addition to a scientist’s toolkit, and the scientific community is only just establishing similar standards for verification and reproducibility in this new context. Peer review and journal publishing have generally not yet adapted to the use of computational methods and still operate as suitable for the deductive or empirical branches, creating a growing credibility gap in computational science.

The key point emerging from the leak of the CRU docs is that without the code and data it is all but impossible to tell whether the research is right or wrong, and this community’s lack of awareness of reproducibility and blustery demeanor does not inspire confidence in their production of reliable knowledge. This leak and the ensuing embarrassment would not have happened if code and data that permit reproducibility had been released alongside the published results. When mature, computational science will produce routinely verifiable results.

Verifying Computational Results without Clear Communication of the Steps Taken is Near-Impossible

The frequent near-impossibility of verification of computational results when reproducibility is not considered a research goal is shown by the miserable travails of “Harry,” a CRU employee with access to their system who was trying to reproduce the temperature results. The leaked documents contain logs of his unsuccessful attempts. It seems reasonable to conclude that CRU’s published results aren’t reproducible if Harry, an insider, was unable to do so after four years.

This example also illustrates why a decision to leave reproducibility to others, beyond a cursory description of methods in the published text, is wholly inadequate for computational science. Harry seems to have had access to the data and code used and he couldn’t replicate the results. The merging and preprocessing of data in preparation for modeling and estimation encompasses a potentially very large number of steps, and a change in any one could produce different results. Just as when fitting models or running simulations, parameter settings and function invocation sequences must be communicated, again because the final results are a culmination of many decisions and without this information each small step must match the original work – a Herculean task. Responding with raw data when questioned about computational results is merely a canard, not intended to seriously facilitate reproducibility.

The story of Penn State professor of meteorology Michael Mann‘s famous hockey stick temperature time series estimates is an example where lack of verifiability had important consequences. In February 2005 two panels examined the integrity of his work and debunked the results, largely from work done by Peter Bloomfield, a statistics professor at North Carolina State University, and Ed Wegman, statistics professor at George Mason University. (See also this site for further explanation of statistical errors.) Release of the code and data used to generate the results in the hockey stick paper likely would have caught the errors earlier, avoided the convening of the panels to assess the papers, and prevented the widespread promulgation of incorrect science. The hockey stick is a dramatic illustration of global warming and became something of a logo for the U.N.’s Intergovernmental Panel of Climate Change (IPCC). Mann was an author of the 2001 IPCC Assessment report, and was a lead author on the “Copenhagen Diagnosis,” a report released Nov 24 and intended to synthesize the hundreds of research papers about human-induced climate change that have been published since the last assessment by the IPCC two years ago. The report was prepared in advance of the Copenhagen climate summit scheduled for Dec 7-18. Emails between CRU researchers and Mann are included in the leak, which happened right before the release of the Copenhagen Diagnosis (a quick search of the leaked emails for “Mann” provided 489 matches).

These reports are important in part because of their impact on policy, as CBS news reports, “In global warming circles, the CRU wields outsize influence: it claims the world’s largest temperature data set, and its work and mathematical models were incorporated into the United Nations Intergovernmental Panel on Climate Change’s 2007 report. That report, in turn, is what the Environmental Protection Agency acknowledged it “relies on most heavily” when concluding that carbon dioxide emissions endanger public health and should be regulated.”

Discussions of Appropriate Level of Code and Data Disclosure on RealClimate.org, Before and After the CRU Leak

For years researchers had requested the data and programs used to produce Mann’s Hockey Stick result, and were resisted. The repeated requests for code and data culminated in Freedom of Information (FOI) requests, in particular those made by Willis Eschenbach, who tells his story of requests he made for underlying code and data up until the time of the leak. It appears that a file, FOI2009.zip, was placed on CRU’s FTP server and then comments alerting people to its existence were posted on several key blogs.

The thinking regarding disclosure of code and data in one part of the climate change community is illustrated in this fascinating discussion on the blog RealClimate.org in February. (Thank you to Michael Nielsen for the pointer.) RealClimate.org has 5 primary authors, one of whom is Michael Mann, and its primary author is Gavin Schmidt who was described earlier this year as a “computer jockeys for Nasa’s James Hansen, the world’s loudest climate alarmist.” In this RealClimate blog post from November 27, Where’s the Data, the position seems to be now very much all in favor of data release, but the first comment asks for the steps taken in reconstructing the results as well. This is right – reproducibility of results should be the concern but does not yet appear to be taken seriously (as also argued here).

Policy and Public Relations

The Hill‘s Blog Briefing Room reported that Senator Inhofe (R-Okla.) will investigate whether the IPCC “cooked the science to make this thing look as if the science was settled, when all the time of course we knew it was not.” With the current emphasis on evidence-based policy making, Inhofe’s review should recommend code and data release and require reliance on verified scientific results in policy making. The Federal Research Public Access Act should be modified to include reproducibility in publicly funded research.

A dangerous ramification from the leak could be an undermining of public confidence in science and the conduct of scientists. My sense is that had this climate modeling community made its code and data readily available in a way that facilitated reproducibility of results, not only would they have avoided this embarrassment but the discourse would have been about scientific methods and results rather than potential evasions of FOIA requests, whether or not data were fudged, or scientists acted improperly in squelching dissent or manipulating journal editorial boards. Perhaps data release is becoming an accepted norm, but code release for reproducibility must follow. The issue here is verification and reproducibility, without which it is all but impossible to tell whether the core science done at CRU was correct or not, even for peer reviewing scientists.

Software and Intellectual Lock-in in Science

In a recent discussion with a friend, a hypothesis occurred to me: that increased levels of computation in scientific research could cause greater intellectual lock-in to particular ideas.

Examining how ideas change in scientific thinking isn’t new. Thomas Kuhn for example caused a revolution himself in how scientific progress is understood with his 1962 book The Structure of Scientific Revolutions. The notion of technological lock-in isn’t new either, see for example Paul David’s examination of how we ended up with the non-optimal QWERTY keyboard (“Clio and the Economics of QWERTY,” AER, 75(2), 1985) or Brian Arthur’s “Competing Technologies and Lock-in by Historical Events: The Dynamics of Allocation Under Increasing Returns” (Economic Journal, 99, 1989).

Computer-based methods are relatively new to scientific research, and are reaching even the most seemingly uncomputational edges of the humanities, like English literature and archaeology. Did Shakespeare really write all the plays attributed to him? Let’s see if word distributions by play are significantly different; or can we use signal processing to “see” artifacts without unearthing them, and thereby preserving artifact features?

Software has the property of encapsulating ideas and methods for scientific problem solving. Software also has a second property: brittleness, it breaks before it bends. Computing hardware has grown steadily in capability, speed, reliability, and capacity, but as Jaron Lanier describes in his essay on The Edge, trends in software are “a macabre parody of Moore’s Law” and the “moment programs grow beyond smallness, their brittleness becomes the most prominent feature, and software engineering becomes Sisyphean.” My concern is that as ideas become increasingly manifest as code, with all the scientific advancement that can imply, it becomes more difficult to adapt, modify, and change the underlying scientific approaches. We become, as scientists, more locked into particular methods for solving scientific questions and particular ways of thinking.

For example, what happens when an approach to solving a problem is encoded in software and becomes a standard tool? Many such tools exist, and are vital to research – just look at the list at Andrej Sali’s highly regarded lab at UCSF, or the statistical packages in the widely used language R, for example. David Donoho laments the now widespread use of test cases he released online to illustrate his methods for particular types of data, “I have seen numerous papers and conference presentations referring to “Blocks,” “Bumps,” “HeaviSine,” and “Doppler” as standards of a sort (this is a practice I object to but am powerless to stop; I wish people would develop new test cases which are more appropriate to illustrate the methodology they are developing).” Code and ideas should be reused and built upon, but at what point does the cost of recoding outweigh the scientific cost of not improving the method? In fact, perhaps counterintuitively, it’s hardware that is routinely upgraded and replaced, not the seemingly ephemeral software.

In his essay Lanier argues that the brittle state of software today results from metaphors used by the first computer scientists – electronic communications devices that sent signals on a wire. It’s an example of intellectual lock-in itself that’s become hardened in how we encode ideas as machine instructions now.

My Interview with ITConversations on Reproducible Research

On September 30, I was interviewed by Jon Udell from ITConversations.org in his Interviews with Innovators series, on Reproducibility of Computational Science.

Here’s the blurb: “If you’re a writer, a musician, or an artist, you can use Creative Commons licenses to share your digital works. But how can scientists license their work for sharing? In this conversation, Victoria Stodden — a fellow with Science Commons — explains to host Jon Udell why scientific output is different and how Science Commons aims to help scientists share it freely.”

Optimal Information Disclosure Levels: Data.gov and "Taleb's Criticism"

I was listening to the audio recording of last Friday’s “Scientific Data for Evidence Based Policy and Decision Making” symposium at the National Academies, and was struck by the earnest effort on the part of members of the Whitehouse to release governmental data to the public. Beth Noveck, Obama’s Deputy Chief Technology Officer for Open Government, frames the effort with a slogan, “Transparency, Participation, and Collaboration.” A plan is being developed by the Whitehouse in collaboration with the OMB to implement these three principles via a “massive release of data in open, downloadable, accessible for machine readable formats, across all agencies, not only in the Whitehouse,” says Beth. “At the heart of this commitment to transparency is a commitment to open data and open information..”

Vivek Kundra, Chief Information Officer in the Whitehouse’s Open Government Initiative, was even more explicit – saying that “the dream here is that you have a grad student, sifting through these datasets at 3 in the morning, who finds, at the intersection of multiple datasets, insight that we may not have seen, or developed a solution that we may not have thought of.”

This is an extraordinary vision. This discussion comes hot on the heels of a debate in Congress regarding the level of information they are willing to release to the public in advance of voting on a bill. Last Wednesday CBS reports, with regard to the health care bill, that “[t]he Senate Finance Committee considered for two hours today a Republican amendment — which was ultimately rejected — that would have required the “legislative” language of the committee’s final bill, along with a cost estimate for the bill, to be posted online for 72 hours before the committee voted on it. Instead, the committee passed a similar amendment, offered by Committee Chair Max Baucus (D-Mont.), to put online the “conceptual” or “plain” language of the bill, along with the cost estimate.” What is remarkable is the sense this gives that somehow the public won’t understand the raw text of the bill (I noticed no compromise position offered that would make both versions available, which seems an obvious solution).

The Whitehouse’s efforts have the potential to test this hypothesis: if given more information will people pull things out of context and promulgate misinformation? The Whitehouse is betting that they won’t, and Kundra does state the Whitehouse is accompanying dataset release with efforts to provide contextual meta-data for each dataset while safeguarding national security and individual privacy rights.

This sense of limits in openness isn’t unique to governmental issues and in my research on data and code sharing among scientists I’ve termed the concern “Taleb’s crticism.” In a 2008 essay on The Edge website, Taleb worries about the dangers that can result from people using statistical methodology without having a clear understanding of the techniques. An example of concern about Taleb’s Criticism appeared on UCSF’s EVA website, a repository of programs for automatic protein structure prediction. The UCSF researchers won’t release their code publicly because, as stated on their website, “We are seriously concerned about the ‘negative’ aspect of the freedom of the Web being that any newcomer can spend a day and hack out a program that predicts 3D structure, put it on the web, and it will be used.” Like the congressmen seemed to fear, for these folks openness is scary because people may misuse the information.

It could be argued, and for scientific research should be argued, that an open dialog of an idea’s merits is preferable to no dialog at all, and misinformation can be countered and exposed. Justice Brandeis famously elucidated this point in Whitney v. California (1927), writing that “If there be time to expose through discussion the falsehood and fallacies, to avert the evil by the processes of education, the remedy to be applied is more speech, not enforced silence.” Data.gov is an experiment in context and may bolster trust in the public release of complex information. Speaking of the Data.gov project, Noveck explained that “the notion of making complex information more accessible to people and to make greater sense of that complex information was really at the heart.” This is a very bold move and it will be fascinating to see the outcome.

Crossposted on Yale Law School’s Information Society Project blog.

What's New at Science Foo Camp 2009

SciFoo is a wonderful annual gathering of thinkers about science. It’s an unconference and people who choose to speak do so. Here’s my reaction to a couple of these talks.

In Pete Worden’s discussion of modeling future climate change, I wondered about the reliability of simulation results. Worden conceded that there are several models doing the same predictions he showed, and they can give wildly opposing results. We need to develop the machinery to quantify error in simulation models just as we routinely do for conventional statistical modeling: simulation is often the only empirical tool we have for guiding policy responses to some of our most pressing issues.

But the newest I saw was Bob Metcalfe’s call for us to imagine what to do with the coming overabundance of energy. Metcalfe likened solving energy scarcity to the early days of Internet development: because of the generative design of Internet technology, we now have things that were unimagined in the early discussions, such as YouTube and online video. According to Metcalfe, we need to envision our future as including a “squanderable abundance” of energy, and use Internet lessons such as standardization and distribution of power sources to get there, rather than building for energy conservation.

Cross posted on The Edge.

Bill Gates to Development Researchers: Create and Share Statistics

I was recently in Doha, Qatar, presenting my research on global communication technology use and democratic tendency at ICTD09. I spoke right before the keynote, Bill Gates, whose main point was that when you engage in a goal-oriented activity, such as development, progress can only be made when you measure the impact of your efforts.

Gates paints a positive picture, measured by deaths before age 5. In the 1880′s he says about 30% of children died before their 5th birthday in most countries, and this gradually moved to 20 million in 1960 and then 10 million in 2006. Gates postulates this is due to rising income levels (40% of decrease), and medical innovation such as vaccines (60% of decrease).

This is an example of Gates’ mantra: you can only improve what you can measure. For example, an outbreak of measles tells you your vaccine system isn’t functioning. In his example about childhood deaths, he says we are getting somewhere here because we are measuring the value for money spent on the problem.

Gates thinks the wealthy in the world need to be exposed to these problems ideally through intermingling, or since that is unlikely to happen, through statistics and data visualization. Collect data, then communicate it. In short, Gates advocates creating statistics through measuring development efforts, and changing the world by exposing people to these data.

Wolfram|Alpha Demoed at Harvard: Limits on Human Understanding?

Yesterday Stephen Wolfram gave the first demo of Wolfram|Alpha, coming in May, what he modestly describes as a system to make our stock of human knowledge computable. It includes not just facts, but also our algorithmic knowledge. He says, “Given all the methods, models ,and equations that have been created from science and analysis – take all that stuff and package it so that we can walk up to a website and ask it a question and have it generate the knowledge that we want. … like interacting with an expert.”

It’s ambitious, but so are Wolfram’s previous projects: Mathematica and Mathworld. I remember relying on Mathworld as a grad student – it was excellent, and so I remember when it suddenly disappeared when the content was to be published as a book. In 2002 he published A New Kind of Science, arguing that all processes, including thought, can be viewed as computations and a simple set of rules can describe a complex system. This thinking is clearly evident in Wolfram|Alpha and here are some key examples.
Continue reading ‘Wolfram|Alpha Demoed at Harvard: Limits on Human Understanding?’

Stuart Shieber and the Future of Open Access Publishing

Back in February Harvard adopted a mandate requiring its faculty member to make their research papers available within a year of publication. Stuart Shieber is a computer science professor at Harvard and responsible for proposing the policy. He has since been named director of Harvard’s new Office for Scholarly Comminication.

On November 12 Shieber gave a talk entitled “The Future of Open Access — and How to Stop It” to give an update on where things stand after the adoption of the open access mandate. Open access isn’t just something that makes sense from an ethical standpoint, as Shieber points out that (for-profit) journal subscription costs have risen out of proportion with inflation costs and out of proportion with the costs of nonprofit journals. He notes that the cost per published page in a commercial journal is six times that of the nonprofits. With the current library budget cuts, open access — meaning both access to articles directly on the web and shifting subscriptions away from for-profit journals — is something that appears financially unavoidable.

Here’s the business model for an Open Access (OA) journal: authors pay a fee upfront in order for their paper to be published. Then the issue of the journal appears on the web (possibly also in print) without an access fee. Conversely, traditional for-profit publishing doesn’t charge the author to publish, but keeps the journal closed and charges subscription fees for access.

Shieber recaps Harvard’s policy:

1. The faculty member grants permission to the University to make the article available through an OA repository.

2. There is a waiver for articles: a faculty member can opt out of the OA mandate at his or her sole discretion. For example, if you have a prior agreement with a publisher you can abide by it.

3. The author themselves deposits the article in the repository.

Shieber notes that the policy is also because it allows Harvard to make a collective statement of principle, systematically provide metadata about articles, it clarifies the rights accruing to the article, it allows the university to facilitate the article deposit process, it allows the university to negotiate collectively, and having the mandate be opt out rather than opt in might increase rights retention at the author level.

So the concern Shieber set up in his talk is whether standards for research quality and peer review will be weakened. Here’s how the dystopian argument runs:

1. all universities enact OA policies
2. all articles become OA
3. libraries cancel subscriptions
4. prices go up on remaining journals
5. these remaining journals can’t recoup their costs
6. publishers can’t adapt their business model
7. so the journals and the logistics of peer review they provide, disappear

Shieber counters this argument: 1 through 5 are good because journals will start to feel some competitive pressure. What would be bad is if publishers cannot change their way of doing business. Shieber thinks that even if this is so it will have the effect of pushing us towards OA journals, which provide the same services, including peer review, as the traditional commercial journals.

But does the process of getting there cause a race to the bottom? The argument goes like this: since OA journals are paid by the number of articles published they will just publish everything, thereby destroying standards. Shieber argues this won’t happen because there is price discrimination among journals – authors will pay more to publish in the more prestigious journals. For example, PLOS costs about $3k, Biomed Central about $1000, and Scientific Publishers International is $96 for an article. Shieber also makes an argument that Harvard should have a fund to support faculty who wish to publish in an OA journal and have no other way to pay the fee.

This seems to imply that researchers with sufficient grant funding or falling under his proposed Harvard publication fee subsidy, would then be immune to the fee pressure and simply submit to the most prestigious journal and work their way down the chain until their paper is accepted. This also means that editors/reviewers decide what constitutes the best scientific articles by determining acceptance.

But is democratic representation in science a goal of OA? Missing from Shieber’s described market for scientific publications is any kind of feedback from the readers. The content of these journals, and the determination of prestige, is defined solely by the editors and reviewers. Maybe this is a good thing. But maybe there’s an opportunity to open this by allowing readers a voice in the market. This could done through ads or a very tiny fee on articles – both would give OA publishers an incentive to respond to the preferences of the readers. Perhaps OA journals should be commercial in the sense of profit-maximizing: they might have a reason to listen to readers and might be more effective at maximizing their prestige level.

This vision of OA publishing still effectively excludes researchers who are unable to secure grants or are not affiliated with a university that offers a publication subsidy. The dream behind OA publishing is that everyone can read the articles, but to fully engage in the intellectual debate quality research must still find its way into print, and at the appropriate level of prestige, regardless of the affiliation of the researcher. This is the other side of OA that is very important for researchers from the developing world or thinkers whose research is not mainstream (see, for example, Garrett Lisi a high impact researcher who is unaffiliated with an institution).

The OA publishing model Shieber describes is a clear step forward from the current model where journals are only accessible by affiliates of universities who have paid the subscription fees. It might be worth continuing to move toward an OA system where, not only can anyone access publications, but any quality research is capable of being published, regardless of the author’s affiliation and wealth. To get around the financial constraints one approach might be to allow journals to fund themselves through ads, or provide subsidies to certain researchers. This also opens up the idea of who decides what is quality research.

Justice Scalia: Populist

Justice Scalia (HLS 1960) is speaking at the inaugural Herbert W. Vaughan Lecture today at Harvard Law School. It’s packed – I arrived at 4pm for the 4:30 talk and joined the end of a long line…. then was immediately told the auditorium was full and was relegated to an overflow room with video. I’m lucky to have been early enough to even see it live.

The topic of the talk hasn’t been announced and we’re all waiting with palpable anticipation in the air. The din is deafening.

Scalia takes the podium. The title of his talk is “Methodology of Originalism.”

His subject is the intersection of constitutional law and history. He notes that the orthodox view of constitutional interpretation, up to the time of the Warren Court, was that the constitution is no different from any other legal text. That is, it bears a static meaning that doesn’t change from generation to generation, although it gets applied to new situations. The application to pre-existing phenomena doesn’t change over time, but these applications do provide the data upon which to decide the cases on the new phenomena.

Things changed when the Warren court permitted in New York Times Co. v. Sullivan 376 U.S. 254 (1964) that good faith libel of public figures was good for democracy. Scalia says this might be so but that change should be made by statute and not by the court. He argues this is respectful of the democratic system in that the laws are reflections of people’s votes. This is the first, and perhaps the best known, of two ways Scalia comes across as populist in this talk. In a question at the end he says that the whole theory of democracy is that a justice is not supposed to be writing a constitution but just reflecting what the american people have decided. If you believe in democracy, he explains, you believe in majority rules. In liberal democracies like ours we have made exceptions and given protection to certain minorites such as religious or political minorities. But his key point is that the people made these exceptions, ie. they were adopted in a democratic fashion.

But doesn’t originalism require you to know the original meaning of a document? and isn’t history a science unto itself, and different from law? Scalia responds to this argument by saying first that history is central to the law, in the very least through the fact that the meanings of words change over time. So inquiry into the past certainly has to do with the law and vice versa. He notes that the only way to assign meaning to many of the phrases in the constitution is through historical understanding: for example “letters of mark and reprisal” and “habeas corpus” etc. Secondly, he gives a deeply non-elitist argument about the quality of expert vs nonexpert reasoning. This is the second way Scalia expresses a populist sentiment.

In District of Columbia v. Heller, 554 U.S. ___ (2008), the petitioners contended that the term “bear arms” only meant a military interpretation, although there are previous cases that show this isn’t true. But this case was about more than the historical usage of words: the 2nd Amendment didn’t say “the people shall have the right to keep and bear arms,” for example, but that “the right of the people to keep and bear arms shall not be infringed” – as if this was a pre-existing right. So Scalia argues that here there was a place for historical inquiry here that showed there was such a pre-exsiting right: in the English bill of Rights of 1689 (found by Blackstone). So now it’s hard to see the 2nd Amendment as more than the right to join a militia. Which goes with the prologue of the 2nd Amendment: the right of a well regulated militia to keep arms. This goes much further than just lexicography.

So what can be expected of judges? Scalia argues, like Churchill’s argument for democracy, that all an originalist need show is that originalism beats the alternatives. He says this isn’t hard to do since inquiry into original meaning is not as difficult as what opponents suggest. He says one place to look when the framer’s intent is not clear is to look at states’ older interpretations. And in the vast majority of cases, including the most controversial ones, the originalist interpretation is clear. His examples of cases with clear original intent are abortion, a right to engage in homosexual sodomy, or assisted suicide, or prohibition of the death penalty (the death penalty was historically the only penalty for a felony) – these rights are not found in the constitution. Determining whether there should be (and hence is, for a non-originalist judge) a right to abortion or same sex marriage or whatnot, requires moral philosophy which Scalia says is harder than historical inquiry.

He also uses as evidence for the symbiotic relationship between law and history that history departments have legal historical scholars and law schools have historical experts.

Scalia gives the case of Thompson v. Oklahoma 487 U.S. 815 (1988) as an example of a situation in which historical reasoning played little part and he uses this as a baseline to argue that the role of historical reasoning in Supreme Court opinions is increasing. The briefs in Thompson were of no help with historical questions since they did not touch on the history of the 8th Amendment, but Scalia says this isn’t surprising since the history of the clause had been written out of the argument by previous thinking. Another case, Morrison v. Olsen 487 U.S. 654 (1988), considered a challenge to the statue creating the independent counsel. Scalia thinks these questions could benefit little from historical clarification, so the briefing in Morrison focused on historical questions such as what did the term “inferior officers” mean at the time of the founding. Two briefs authored by HLS faculty (Cox, Fried) provided useful historical material, but the historical referencing was sparse and none of these briefs were written by scholars of legal history.

In contrast, in 2007 in Heller there was again little historical context but in this case many amicus briefs focused on historical arguments and material. This is a very different situation to that of 20 years ago. There were several briefs from legal historical experts and each contained detailed discussions of the historical right to bear arms in England and here at the time of the founding. Such foci were the heart of the brief, and not relegated to a footnote as it likely would have been 20 years ago, and was in Morrison. Scalia thinks this reinforces the use of the originalist approach, by showing how easy it is compared to other approaches.

Scalia eschews amicus briefs in general, especially insofar as they repeat the arguments made by the parties because of their pretense to scholarly impartiality which may convince judges to sign on to briefs that are nothing but impartial. “Disinterested scholarship and advocacy do not mix well.”

Scalia takes on a second argument made against the use of history in the courts – that the history used is “law office history.” That is, the selection of data favorable to the position being advanced without regard or concern for contradictory data or relevance. Here the charge is not incompentance but tendentiousness: advocates cannot be trusted to present an unbiased view. But of course! says Scalia, since they are advocates. But insofar as the criticism is directed at the court, it is essential that the adjudicator is impartial. “Of course a judicial opinion can give a distorted picture of historical truth, but this would be an inadequate historical opinion and not that which is expected” from the Court. Scalia admonishes that one must review the historical evidence in detail rather than raise the “know nothing” cry.

This is Scalia’s second populist argument: it is deeply non-elitist since it seems to imply that nonprofessional historians are capable of coming up with good historical understanding. It provides an example that dovetails with the notion of opening knowledge and the respect for autonomy in allow individuals to evaluate reasoning and data and come to their own conclusions (and even be right sometimes). Scalia notes that he sees the role of the Court as finding conclusions from these facts, which is different from the role of the historians.

But he feels quite differently about the conclusions of experts in other fields. For example, in overruling Dr. Miles Medical Co. v. John D. Park and Sons, 220 U.S. 373 (1911), holding that resale price maintenance isn’t a per se violation of the Sherman Act, he didn’t feel uncomfortable since this is the almost uniform view of professional economists. Scalia seems to be saying that experts are probably right more often than nonexperts, but nonexperts can also contribute. He phrases this as an expert in judicial analysis – and he says there is a difference in historical analysis vs, say, the type of engineering analysis that might be required for patent cases. He makes a distinction between types of subject which are more susceptible to successful nonexpert analysis.

Scalia then advocates for submission of analysis to public scrutiny with data open, thus allowing suspect conclusions to be challenged. The originalist will reach substantive results he doesn’t personally favor and the reasoning process should be open. Scalia notes that this is more honest that judges who reason morally, who will never disagree with their own opinions.

There was a question that got the audience laughing at the end. The questioner claims to have approached a Raytheon manufacturing facility to buy a missile or tank, since in his view the 2nd Amendment is about keeping the government scared of the people, and somehow having a gun when the government has more advanced weaponry misses the point. Scalia thinks this is outside the scope of the 2nd Amendment because “You can’t bear a tank!”

A2K3: Opening Scientific Research Requires Societal Change

In the A2K3 panel on Open Access to Science and Research, Eve Gray, from the Centre for Educational Technology, University of Cape Town, sees the Open Access movement as a real societal change. Accordingly she shows us a picture of Nelson Mandela and asks us to think about his release from prison and the amount of change that ushered in. She also asks us to consider whether or not Mandela is an international person or a local person. She sees a parallel with how South African society changed with Mandela and the change people are advocation toward open access to research knowledge. She shows a worldmapper.org map of countries distorted by the amount of (copyrighted) scientific research publications. South Africa looks small. She blames this on South Africa’s willingness to uphold colonial traditions in copyright law and norms in knowledge dissemination. She says this happens almost unquestioningly, and in South Africa to rise in the research world you are expected to publish in ‘international’ journals – the prestigious journals are not South African, she says (I am familiar with this attitude from my own experience in Canada. The top American journals and schools were considered the holy grail. When I asked about attending a top American graduate school I was laughed at by a professor and told that maybe it could happen, if perhaps I had an Olympic gold medal.) She states that for real change in this area to come about people have to recognize that they must mediate a “complex meshing” of policies: at the university level, and the various government levels, norms and the individual scientist level… just as Mandela had to mediate a large number of complex policies at a variety of different levels in order to bring about the change he did.

Legal Barriers to Open Science: my SciFoo talk

I had an amazing time participating at Science Foo Camp this year. This is a unique conference: there are 200 invitees comprising some of the most innovative thinkers about science today. Most are scientists but not all – there are publishers, science reporters, scientific entrepreneurs, writers on science, and so on. I met old friends there and found many amazing new ones.

One thing that I was glad to see was the level of interest in Open Science. Some of the top thinkers in this area were there and I’d guess at least half the participants are highly motivated by this problem. There were sessions on reporting negative results, the future of the scientific method, reproducibility in science. I organized a session with Michael Nielsen on overcoming barriers in open science. I spoke about the legal barriers and O’Reilly Media has made the talk available here.

I have papers forthcoming on this topic you can find on my website.

A2K3 Kaltura Award

I am honored and humbled to win the A2K3 Kaltura prize for best paper. Peter Suber posts about it here and gives the abstract. His post also includes a link to a draft of the paper, which can also be found here: Enabling Reproducible Research: Open Licensing For Scientific Innovation. I’d love comments and feedback although please be aware that since the paper is forthcoming in the International Journal of Communications Law and Policy it will very likely undergo changes. Thank you to Kaltura.com and the entire A2K3 committee. I’m very happy to be here in Geneva and enjoying every minute. :)

A2K3: A World Trade Agreement for Knowledge?

Thiru Balasubramanian, Geneva Representative for Knowledge Ecology International presents a proposal (from a forthcoming paper by James Love and Manon Ress) for a WTO treaty on knowledge (so far all WTO agreements extend to private goods only). Since information is a public good (nonrival and nonexcludable), we will have a “market failure” if single countries act alone: hence the undersupply of global public goods. The WTO creates binding agreements and thus such an agreement for public goods such as knowledge creates large collective benefits and high costs to acting against them. Such a WTO agreement would outline and influence norms. Why do this within the WTO? There are strong enforcement mechanisms here. Are we really undersupplying open and free knowledge? I can think of several scientific examples. Balasubramaniam doesn’t dig in to what such an agreement would look like and seems quite complex. Thinking about this might provide a coherent framework for approaching free information issues globally.

A2K3: Tim Hubbard on Open Science

In the first panel at A2K3 on the history, impact, and future of the global A2K movement, Tim Hubbard, a genetics researcher, laments that scientists tend to carry out their work in a closed way and thus very little data is released. In fact he claims that biologists used to deliberately mess up images so that they could not be reproduced! But apparently journals are more demanding now and this problem has largely been corrected (for example Nature’s 2006 standards on image fraud). He says that openness in science needs to happen before publication, the traditional time when scientists release their work. But this is a tough problem. Data must be released in such a way that others can understand and use it. This parallels the argument made in the opening remarks about the value of net neutrality as preserving an innovation platform: in order for data to be used it must be open in the sense that it permits further innovation. He says we now have Open Genome Data but privacy issues are pertinent: even summaries of the data can be backsolved to identify individuals. He asks for better encryption algorithms to protect privacy. In the meantime he proposes two other solutions. We could just stop worrying about the privacy of our genetic data, just like we don’t hide our race or gender. Failing that, he wants to mine the UK’s National Health Service’s patient records through an “honest broker” which is an intermediary that runs programs and scripts on the data that researchers submit. The data are hidden from the researcher and only accessed through the intermediary. Another problem this solves is the enormity of the released data that can prevent interested people from moving the data or analyzing it. This has broad implications as Hubbard points out – the government could access their CCTV video recordings to find drivers who’ve let their insurance lapse, but not track other possibly privacy violating aspects of drivers’ visible presence on the road. Hubbard is touching on what might be the most important part of the Access to Knowledge movement – how to make the access meaningful without destroying incentives to be open.