I love this idea: http://nomoresopa.com/wp/. It’s an Android app that allows you to scan a product’s barcode and it will tell you whether the company that makes the product supports the Stop Online Piracy Act. What’s really happening here is the ability to get product information at the time of purchase decision. You could, for example, find out what a manufacturer’s parent companies are. Did you know that Cascadian Farm, makers of breakfast cereals and carried by Whole Foods, is owned by General Mills? This information is easily found on the General Mills website but not obvious when you’re looking at the box in the store. This kind of information can now be made readily available to consumers so they can make better and hopefully less biased choices. Love it.
Two recent articles call for an openness revolution in science: one on GigaOM and the other in the Wall Street Journal. But they’ve got it all wrong. These folks are missing that the process of scientific discovery is not, at its core, an open process. It only becomes an open process at the point of publication.
I am not necessarily in favor of greater openness during the process of scientific collaboration. I am however necessarily in favor of openness in communication of the discoveries at the time of publication. Publication is the point at which authors feel their work is ready for public presentation and scrutiny (that traditional publication does not actually give the public access to new scientific knowledge is a tragedy and of course we should have Open Access). We even have a standard for the level of openness required at the time of scientific publication: reproducibility. This concept has been part of the scientific method since it was instigated by Robert Boyle in the 1660’s: communicate sufficient information such that another researcher in the field can replicate your work, given appropriate tools and equipment.
If we’ve already been doing this for hundreds of years what’s the big deal now? Leaving aside the question of whether or not published scientific findings have actually been reproducible over the last few hundred years (for work on this question see e.g. Changing Order: Replication and Induction in Scientific Practice by Harry Collins), science, like many other areas of modern life, is being transformed by the computer. It is now tough to find any empirical scientific research not touched by the computer in some way, from simply storing and analyzing records to the radical transformation of scientific inquiry through massive multicore simulations of physical systems (see e.g. The Dynamics of Plate Tectonics and Mantle Flow: From Local to Global Scales).
This, combined with the facilitation of digital communication we call the Internet, is an enormous opportunity for scientific advancement. Not because we can collaborate or share our work pre-publication as the articles assert – all of which we can do, if we like – but because computational research captures far more of the tacit knowledge involved in replicating a scientific experiment that ever before, making our science potentially more verifiable. The code and data another scientist believes replicates his or her experiments can capture all digital manipulations of the data, which now comprise much of the scientific discovery process. Commonly used techniques, such as the simulations carried out in papers included in sparselab.stanford.edu (see e.g “Breakdown Point of Model Selection when the Number of Variables Exceeds the Number of Observations” or any of the SparseLab papers), can now be replicated by downloading the short scripts we included with publication (see e.g. Reproducible Research in Computational Harmonic Analysis). This is a far cry from Boyle’s physical experiments with the air pump, and I’d argue one with enormously lower levels of tacit knowledge to communicate that we’re not capitalizing on today.
Computational experiments are complex. Without communicating the data and code that generated the result it is nearly impossible to understand what was done. My thesis advisor famously paraphrased this idea, “An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.”
It is simply wrong to assert that science has not adopted computational tools, as the GigoOM article does when it says “there is another world that makes these industries [traditional media players such as newspapers, magazines and book publishers] look like the most enthusiastic of early adopters [of technological progress]: namely, academic research.” Everywhere you look computation is emerging as central to the scientific method. What needs to happen is a resolution of the credibility crisis in computational science resulting from this enthusiastic technological adoption: we need to establish the routine inclusion of the data and code with publication, such that the results can be conveniently replicated and verified. Many new efforts to help data and code sharing are documented here, from workflow tracking to code portability to unique identifiers that permit the regeneration of results in the cloud.
This is an imperative issue, most of our published computational results today are unverified and unverifiable (I have written on this extensively, as have many others). The emphasis on the process of creative problem solving as the bottleneck in scientific discovery is misplaced. Scientific innovation is inherently creative – perhaps more collaborative tools that permit openness will encourage greater innovation. Or perhaps it is the case, as Paul Graham has pointed out in another context, that innovative thinking is largely a solitary act. What is clear is that it is of primary importance to establish publication practices that facilitate the replication and verification of results by including data and code, and not to confound these two issues.
Computational scientists need to understand and assert their computational needs, and see that they are met.
I just read this excellent interview with Donald Knuth, inventor of TeX and the concept of literature programming, as well as author of the famous textbook, The Art of Computer Programming. When asked for comments on (the lack of) software development using multicore processing, he says something very interesting – that multicore technology isn’t that useful, except in a few applications such as “rendering graphics, breaking codes, scanning images, simulating physical and biological processes, etc.” This caught my eye because parallel processing is a key advance for data processing. Statistical analysis of data typically executes line by line through the data, making it ideal for multithreaded applications. This isn’t some obscure part of science either – most science carried out today has some element of digital data processing (although of course not always at scales that warrant implementing parallel processing).
Knuth then says that “all these applications [that use parallel processing] require dedicated code and special-purpose techniques, which will need to be changed substantially every few years.” As the state of our scientific knowledge changes so does our problem solving ability, requiring modification of code used to generate scientific discovery. If I’m reading him correctly, Knuth seems to think this makes such applications less relevant to mainstream computer science.
The discussion reminded me of comments made at the “Workshop on Algorithms for Modern Massive Datasets” at Stanford in June 2010. Researchers in scientific computation (a specialized subdiscipline of computational science, see the Institute for Computational and Mathematical Engineering at Stanford or UT Austin’s Institute for Computational Engineering and Sciences for examples) were lamenting the direction computer hardware architecture was taking toward facilitating certain particular problems, such as particular techniques for matrix inversion and hot topics in linear algebra.
As scientific discovery transforms into a deeply computational process, we computational scientists must be prepared to partner with computer scientists to develop tools suited to the needs of scientific knowledge creation, or develop these skills ourselves. I’ve written elsewhere on the need to develop software that natively supports scientific ends (especially for workflow sharing; see e.g. http://stodden.net/AMP2011 ) and this applies to hardware as well.
Right now scientific questions are chosen for study in a largely autocratic way. Typically grants for research on particular questions come from federal funding agencies, and scientists competitively apply with the money going to the chosen researcher via a peer review process.
I suspect, as the tools of online science become increasingly available, the real questions people face in their day to day lives will be more readily answered. If you think about all the things you do and decisions you make in a day, many of them don’t have a strong empirical basis. How you wash the dishes or do laundry, what foods are healthy, what environment to maintain in your house, what common illness remedies work best, who knows, but these types of questions, the ones that occur to you as you go about your daily business, aren’t prioritized in the investigatory model we have now for science. I predict that scientific investigation as a whole, not just that that is government funded, will move substantially toward providing answers to questions of local importance.
This past January Obama signed the America COMPETES Re-authorization Act. It contains two interesting sections that advance the notions of open data and the federal role in supporting online access to scientific archives: 103 and 104, which read in part:
• § 103: “The Director [of the Office of Science and Technology Policy at the Whitehouse] shall establish a working group under the National Science and Technology Council with the responsibility to coordinate Federal science agency research and policies related to the dissemination and long-term stewardship of the results of unclassified research, including digital data and peer-reviewed scholarly publications, supported wholly, or in part, by funding from the Federal science agencies.” (emphasis added)
This is a cause for celebration insofar as Congress has recognized that published articles are an incomplete communication of computational scientific knowledge, and the data (and code) must be included as well.
• § 104: Federal Scientific Collections: The Office of Science and Technology Policy “shall develop policies for the management and use of Federal scientific collections to improve the quality, organization, access, including online access, and long-term preservation of such collections for the benefit of the scientific enterprise.” (emphasis added)
I was very happy to see the importance of online access recognized, and hopefully this will include the data and code that underlies published computational results.
One step further in each of these directions: mention code explicitly and create a federally funded cloud not only for data but linked to code and computational results to enable reproducibility.
Generalize clinicaltrials.gov and register research hypotheses before analysis
Stanley Young is Director of Bioinformatics at the National Institute for Statistical Sciences, and gave a talk in 2009 on problems in modern scientific research. For example: 1 in 20 NIH-funded studies actually replicates; closed data and opacity; model selection for significance; multiple comparisons.. Here is the link to his talk: Everything Is Dangerous: A Controversy. There are a number of good examples in the talk and Young anticipates and is more intellectually coherent than the New Yorker article The Truth Wears Off if you were interested in that.
Idea: Generalize clinicaltrials.gov, where scientists register their hypotheses prior to carrying out their experiment. Why not do this for all hypothesis tests? Have a site where the hypotheses are logged and time stamped before researchers gather the data or carry out the actual hypothesis testing for the project. I’ve heard this idea mentioned occasionally and both Young and Lehrer mentions it as well.
Imagine a cell phone database that includes terms of service, prices, fees, rates, different calling plans, quality of services, coverage maps etc. – “smart disclosure,” as the term is being used in federal government circles, means how to make data available such that it can be used and analyzed. Part of smart disclosure would mean collecting information from consumers as well, such as user experiences, bills, service complaints. This is the vision of the FCC’s chief of their Consumer and Governmental Affairs Bureau Joel Gurin at the Open Gov R&D Summit organized by the whitehouse. He notes that right away you run into issues of privacy and proprietary data that still need to be worked out.
He gives two examples of when it has worked: healthcare.gov – gov has collected and presented data but become the intermediary in presenting this data [I took a brief look at this site and don't see where to download data]. Another example is brightscope: they analyzed government released pension and 401(k) fees to create a ranking product they sell to hr managers so that folks can understand the appropriateness of the fees they pay.
The potential is enormous: imagine openness in FCC data. Gurin asks, how do we let many brightscopes bloom?
Christopher Meyer, vice president for external affairs and information services for the Consumers Union, gives an example of failure through database mismanagement. There was a spike in their dataset of consumer complaints about acceleration problems in toyota cars. They didn’t look at the data and didn’t notice this before Toyota issued the official recall. They’d like to do better, and have better organization in their data and better tools for issue detection through consumer complaints, with a mechanism to permit the manufacturer to respond early.
Peter Levin is CTO of the Dept of Veteran’s Affairs and has a take on open gov tailored to his department: He’s restructuring the IT infrastructure within the VA to facilitate access. For example, the VA just processed their first paperless claim and is reducing claim turnaround time from 165 days to 40 days.
He is also focusing his efforts on emotional paths to engagement rather than numbers and figures. I hope they can provide both, but I see his comments as a reaction and criticism to open data in general. Levin gives the analogy of the introduction of the telephone – the phone was fundamentally social in nature and hence caught on beyond folks’ expectations, whereas a simply communicator of facts would not. That encapsulates his vision for tech changes at the VA.
James Hamilton of Northwestern suggests the best way to help reporting on government info and the communication of govt activities would be to improve the implementation of the Freedom of Information Act, in particular for journalists. The aim is to improve govt accountability. He also advocates machine learning techniques to automatically analyze comments and draw meaning from data in a variety of formats, like text analysis. He believes this software exists and is in use by the govt (even if that is true I am doubtful of how well it works) and an big improvement would be to make this software open source (he references Gary King’s software on text clustering too, which is open and has been repurposed by AP for example).
George Strawn from the National Coordination Office (NITRD) notes that there are big problem even combining data within agencies, let alone putting together datasets from disparate sources. He says in his experience agency directors aren’t getting the data they need, data that is theoretically available, to make their decisions.
I’m here at the National Archives attending the Open Government Research and Development Summit, organized by the Office of Science and Technology Policy in the Whitehouse. It’s a series of panel discussions to address questions about the impact and future directions of Obama’s open gov initiative, in particular how to develop a deeper research agenda with the resulting gov data (see the schedule here).
Aneesh Chopra, our country’s first CTO, gave a framing talk in which he listed 5 questions he’d like to have answered through this workshop.
1. big data: how strengthen capacity to understand massive data?
2. new products: what constitutes high value data?
3. open platforms: what are the policy implications of enabling 3rd party apps?
4. international collaboration: what models translate to strengthen democracy internationally?
5. digital norms: what works and what doesn’t work in public engagement?
He hopes the rest of the workshop will not only address these questions and coalesce around recommendations. Chopra wants to be able to set innovation prizes to move towards solutions to these questions.
Here’s a vid of Keith Baggerly explaining his famous case study of why we need code and data to be made openly available in computational science: http://videolectures.net/cancerbioinformatics2010_baggerly_irrh. This is the work that resulted in the termination of clinical trials at Duke last November and the resignation of Anil Potti. Patients had been assigned into groups and actually given drugs before the trials were stopped. The story is shocking.
It’s also a good example of why traditional publishing doesn’t capture enough detail for reproducibility without the inclusion of data and code. Baggerly’s group at M.D. Anderson was able to make reproducing these results, what he has labeled “forensic biostatistics,” a priority and they spent an enormous amount of time doing this. We certainly need independent verification of results but to do so can often require knowledge of the methodology contained only in the code and data. In addition, Donoho et al (earlier version here) make the point that even when findings are independently replicated, open code and data is necessary to understand the reason for discrepancies in results. In a section in the paper listing and addressing objections we say:
Objection: True Reproducibility Means Reproducibility from First Principles.
Argument: It proves nothing if I point and click and see a bunch of numbers as expected. It only proves something if I start from scratch and build your system and in my implementation I get your results.
Response: If you exactly reproduce my results from scratch, that is quite an achievement! But it proves nothing if your implementation fails to give my results since we won’t know why. The only way we’d ever get to the bottom of such discrepancy is if we both worked reproducibly.
(ps. Audio and slides for a slightly shorter version of Baggerly’s talk here)