Coronavirus diaries – Champaign edition

Illinois locked down relatively early, when a statewide order was issued starting March 21, 2020. Like every lockdown we no longer went to work, restaurants or bars, or visited friends. So after a period of reclusiveness I started going on daily walks and runs. I’ve been doing this inconsistently throughout my years in Champaign, but this time is different. I’m usually the only person I see exercising. There’s an occasional dog walker and once in a while I pass by a runner (always tempted to high five them) but this time, during the lockdown, there are a lot of people out wandering. Solo. Almost all listening to earbuds. It’s getting to the point that I pass the same people on their daily walks every day. It’s been so humid and so hot over the last few weeks we are all now compressing into those couple of hours just before sunset, madly swerving each other this time, engrossed in our own podcast worlds.

Posted in Diary, Fascinating People, health | Leave a comment

Why Distinguishing Experiential and Inferential Computational Knowledge is Important

Scientific communities are grappling with assessing knowledge derived from computational processes: interpretability of machine learning models; generalizability of data science claims; verifiability of computational science. There are two broad categories of “knowing” at play here that distinguish approaches to understanding when a claim can be deemed a contribution to our stock of knowledge. I’m calling these categories Experiential and Inferential computational knowledge in a direct extension of general experiential and logical ways of knowing about our world.

The first general class of problems data science tries to solve are classification-type problems that appear in many online and industry applications such as image classification or recommender systems. The second is process-type problems such as understanding drug effectiveness or the impact of policy changes. The motivation of the researcher is different in these two broad problem classes. One is effectively seeking an engineering solution that makes reliable predictions or outcomes in a circumscribed context with little concern for how those predictions are made, the other is seeking a generalizable science solution where elucidating the underlying mechanisms that generate the outcome is of key interest. In the first case it isn’t a question of whether the model is right but whether it’s giving reliable outputs. In that case we’re learning by absorbing data and finding patterns, similar to learning about the world noncomputationally: learning how people you know act, what’s a broken heart, or even what happens if you slam on the car brakes. Cause and effect reasoning. Explaining why we expect a certain outcome isn’t part of the learning process, but we still feel like we know something given a set of experiences. Arguably this is the most common way we learn.

This is different to inferential or explainable knowledge where I can present the chain of logic that I purport leads to a result and you can learn why that result happens, or is true. That knowledge is directly transferrable by communication, and does not require experience (although it may rely on other knowledge elements). So, is one type of knowledge more valuable? more reliable? or to be prioritized? Possibly. My argument is that we have an analogy in computationally-enabled and data-enabled research and should expect different justifications to convince us why we should accept the result for the two different types of computational knowledge. And we should expect different generalizability for the two types of knowledge. Experiential knowledge generalizes by analogy so understanding the context under which it hold is key, inferential knowledge by explanation so understanding the reasoning that generated the result is key. So what does that mean for computationally generated knowledge? Depends which model it follows.

For inferential computational knowledge to generalize how and why the result holds is an essential part of the communication, and these aspects are in part embedded in the computational steps that generated the result. The code, the inputs (e.g. empirical data, function settings and parameters, machine state variables), expected output and expect variability of the output are unavoidable epistemological components, along with the noncomputational logical reasoning (e.g. statistical tests used). For experiential knowledge this is not necessarily the case. In theory all I need is the black box that produces the output from the inputs and a clear understanding of what outputs different types of inputs will produce (for most machine learning problems, or black box problems in general, this latter component is very hard, making misuse of black boxes almost guaranteed). What are the characteristics of the data that I need to assure before this black box can be applied?

We need to treat both types of computational knowledge differently and not tacitly assume computationally generated output from data is necessarily scientific and therefore generalizable. We need to improve both computational disclosure for computational knowledge we intend as a generalizable result (inferential knowledge) so we have a verifiable computational chain of reasoning as well as a logical chain, and improve contextual disclosure for experientially/data based computational output. Without doing so the ways of computationally knowing are ways of knowing nothing.

Posted in Machine Learning, Reproducible Research, Scientific Method | Leave a comment

Fascinating People: Jason Kohn

Jason Kohn is a documentary maker and director, interviewed most recently by Pure Nonfiction for his more recent release Love Means Zero, about the superstar tennis coach Nick Bollettieri https://purenonfiction.net/83-jason-kohn-love-means-zero/ . The first half of the interview is Kohn’s origin story and explores his relationship to film making and directing. The second half about his new movie. Kohn comes across as someone captivated by people he perceives as marching to the beat of their own drum (and perhaps even unaware that they are doing so). He discusses his first movie, Manda Bala (Send a Bullet), made ten years ago and based on his observations when living in Brazil. He notes that his movies are about narratives, as all movies, and anyone looking to be informed or educated about, say, world events will be disappointed. He also tells a story about how he became a film director: a childhood friend and business partner printed up business cards for them with the moniker when they were starting out. Kohn wonders where his friend got the chutzpah to do that but he accepted the label, even though growing up in middle class Long Island his sense was that becoming a film maker was limited to either the privileged or the genius, of which he considered himself neither. I’m not so sure about the latter, and recommend this interview and his films if you haven’t seen them.

Posted in Fascinating People, Film | Leave a comment

Paul Meier: Still Saving Millions of Lives

You may never have heard of Paul Meier, but perhaps I can convince you he is one of the 2oth century’s greatest heroes. As I write from my “shelter in place” from the violent COVID-19 pandemic, I’m seeing news stories about potential treatments. The hope they convey for a treatment to end the death, suffering, uncertainty, and fear is overwhelming. Meier is the reason why we aren’t routinely administering potential treatments to people who are ill or dying from the coronavirus and he’s the reason we live in a society with so many safe and effective medical treatments. He’s primarily responsible for the idea and arguably the implementation of the Randomized Clinical Trial and the important reason we won’t have a vaccine available to the public, even one discovered today, until 2021 or 2022.

Continue reading
Posted in Fascinating People, Open Science, Scientific Method, Statistics | Leave a comment

Back to blogging?

This blog’s been on hold while I prepared for and (successfully) went through the tenure process. I’d like to start posting again, and I’m shocked that the break has been almost six years..

Posted in Uncategorized | 2 Comments

My input for the OSTP RFI on reproducibility

Until Sept 23 2014, the US Office of Science and Technology Policy in the Whitehouse was accepting comments on their “Strategy for American Innovation.” My submitted comments on one part of that RFI, section 11 follow (with minor corrections):

“11) Given recent evidence of the irreproducibility of a surprising number of published scientific findings, how can the Federal Government leverage its role as a significant funder of scientific research to most effectively address the problem?”

Continue reading
Posted in Economics, Intellectual Property, Law, Open Data, Open Science, OSTP, Reproducible Research, Scientific Method | 2 Comments

Mistakes by Piketty are OK (even good?)

In an email conversation I tried to make some points about the criticism Piketty has come under for apparently having mistakes in his data. I think the concerns are real but misplaced. Here’s why:

There was a point made in the reproducibility session at DataEDGE this month by Fernando Perez that I think sums up my perspective on this pretty well: making good faith mistakes is human and honest (and ok), but the important point is that we need to be able to verify findings.

Continue reading
Posted in Uncategorized | Leave a comment

Changes in the Research Process Must Come From the Scientific Community, not Federal Regulation

I wrote this piece as an invited policy article for a major journal but they declined to publish it. It’s still very much a draft and they made some suggestions, but since realistically I won’t be able to get back to this for a while and the text is becoming increasingly dated, I thought I would post it here. Enjoy!

Recent U.S. policy changes are mandating a particular vision of scientific communication: public access to data and publications for federally funded research. On February 22, 2013, the Office of Science and Technology Policy (OSTP) in the Whitehouse released an executive memorandum instructing the major federal funding agencies to develop plans to make both the datasets and research articles resulting from their grants publicly available [1]. On March 5, the House Science, Space, and Technology subcommittee convened a hearing on Scientific Integrity & Transparency and on May 9, President Obama issued an executive order requiring government data to be made openly available to the public [2].

Many in the scientific community have demanded increased data and code disclosure in scholarly dissemination to address issues of computational reproducibility in computational science [3-19]. At first blush, the federal policies changes appear to support these scientific goals, but the scope of government action is limited in ways that can impair its ability to respond directly to these concerns. The scientific community should not rely on federal policy to bring about changes that enable reproducible computational research. These recent policy changes must be a catalyst for a well-considered update in research dissemination standards by the scientific community: computational science must move to publication standards that include the digital data and code sufficient to permit others in the field to computationally reproduce and verify the results. Authors and journals must be ready to use existing repositories and infrastructure to ensure the communication of reproducible computational discoveries.
Continue reading

Posted in Intellectual Property, Law, Open Data, Open Science, OSTP, Reproducible Research | 5 Comments

Peanut allergic reaction

I’ve used this blog for my professional interests thinking that my personal life just isn’t all that interesting. I still don’t think my personal life is of broad interest, but I’m going to describe what happened to me after an accidental exposure to peanuts yesterday. I’m motivated for two reasons. One, I couldn’t find much personal discussion of these allergic responses and I think it would be helpful to have more (there was lots of discussion from moms, or potential causes, or badly written pseudo science but very few actual stories). Two, I was supposed to meet a friend last night and had to cancel because of this, and his reaction made me realize that these severe allergic reactions don’t seem to be well understood or accepted in general.

After the jump I’ll go into detail. If you are squeamish don’t read, and/or if you know me professionally you may not want to continue, in order to permit some dignity to persist in our future interactions. But I think the story is important for others who suffer from this. I remember lying there wondering if this or that that was happening was normal. Turns out, yes, but that’s not easy info to find.

Continue reading

Posted in anaphylaxis, health, personal | 2 Comments

What the Reinhart & Rogoff Debacle Really Shows: Verifying Empirical Results Needs to be Routine

There’s been an enormous amount of buzz since a study was released this week questioning the methodology in a published paper. The paper under fire is Reinhart and Rogoff’s “Growth in a Time of Debt” and the firing is being done by Herndon, Ash, and Pollin in their article “Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff.” Herndon, Ash, and Pollin claim to have found “spreadsheet errors, omission of available data, weighting, and transcription” in the original research which, when corrected, significantly reduce the magnitude of the original findings. These corrections were possible because of openness in economics, and this openness needs to be extended to make all computational publications reproducible.

How did this come about?
Continue reading

Posted in Uncategorized | 7 Comments

Data access going the way of journal article access? Insist on open data

The discussion around open access to published scientific results, the Open Access movement, is well known. The primary cause of the current situation — journal publishers owning copyright on journal articles and therefore charging for access — stems from authors signing their copyright over to the journals. I believe another solution should have been used, such as granting the journal a license to publish i.e. like Science’s readily available alternative license. At some level authors were potentially entering into binding legal contracts without an understanding of the implications and without the right counsel.

I am seeing a similar situation arising with respect to data. It is not atypical for a data producing entity, particularly those in the commercial sphere, to require that researchers with access to the data sign a non-disclosure agreement. This seems to be standard for Facebook data, Elsevier data, and many many others. I’m witnessing researchers grabbing their pens and signing, and like in the publication context, feeling themselves powerless to do otherwise. Again, they are without the appropriate counsel. Even the general counsel’s office at their institution typically sees the GC’s role as protecting the institution against liability, rather than the larger concern of protecting the scholar’s work and the integrity of the scholarly record. What happens when research from these protected datasets is published, and questioned? How can others independently verify the findings? They might need access to the data.

There are many legitimate reasons such data may not be able to be publicly released, for example protection of subjects’ privacy (see what happened when Harvard released Facebook data from a study). But as scientists we should be mindful of the need for our published findings to be reproducible. Some commercial data do not come with privacy concerns, only concerns from the company that they are still able to sell the data to other commercial entities, and sometimes not even that. Sometimes lawyers simply want an NDA to minimize any risk to the commercial entity that might arise should the data be released. They are not stewards of scientific knowledge, after all.

It is also perfectly rational for authors publishing findings based on these data to push back as hard as possible to ensure maximum reproducibility and credibility of their results. Many companies share data with scientists because they seek to deepen goodwill and ties with the academic community, or they are interested in the results of the research. As researchers we should condition our acceptance of the data on its release when the findings are published, if there are no privacy concerns associated with the data. If there are privacy concerns I can imagine ensuring we can share the data in a “walled garden” within which other researchers, but not the public, will be able to access the data and verify results. There are a number of solutions that can bridge the gap between open access to data and an access-blocking NDA (e.g. differential privacy) and as scientists the integrity and reproducibility of our work is a core concern that we have responsibility for in this negotiation for data.

A few template data sharing agreements between academic researchers and data producing companies would be very helpful, if anyone feels like taking a crack at drafting them (Creative Commons?). Awareness of the issue is also important, among researchers, publishers, funders, and data producing entities. We cannot unthinkingly default to a legal situation regarding data that is anathema to scientific progress, as we did with access to scholarly publications.

Posted in Intellectual Property, Law, Open Data, Open Science, Reproducible Research, Scientific Method | 6 Comments

Getting Beyond Marketing: Scan and Tell

I love this idea: http://nomoresopa.com/wp/. It’s an Android app that allows you to scan a product’s barcode and it will tell you whether the company that makes the product supports the Stop Online Piracy Act. What’s really happening here is the ability to get product information at the time of purchase decision. You could, for example, find out what a manufacturer’s parent companies are. Did you know that Cascadian Farm, makers of breakfast cereals and carried by Whole Foods, is owned by General Mills? This information is easily found on the General Mills website but not obvious when you’re looking at the box in the store. This kind of information can now be made readily available to consumers so they can make better and hopefully less biased choices. Love it.

Posted in Uncategorized | 1 Comment

Disrupt science: But not how you’d think

Two recent articles call for an openness revolution in science: one on GigaOM and the other in the Wall Street Journal. But they’ve got it all wrong. These folks are missing that the process of scientific discovery is not, at its core, an open process. It becomes an open process at the point of publication.

I am not necessarily in favor of greater openness during the process of scientific collaboration. I am however necessarily in favor of openness in communication of the discoveries at the time of publication. Publication is the point at which authors feel their work is ready for public presentation and scrutiny (that traditional publication does not actually give the public access to new scientific knowledge is a tragedy and of course we should have Open Access). We even have a standard for the level of openness required at the time of scientific publication: reproducibility. This concept has been part of the scientific method since it was instigated by Robert Boyle in the 1660’s: communicate sufficient information such that another researcher in the field can replicate your work, given appropriate tools and equipment.

If we’ve already been doing this for hundreds of years what’s the big deal now? Leaving aside the question of whether or not published scientific findings have actually been reproducible over the last few hundred years (for work on this question see e.g. Changing Order: Replication and Induction in Scientific Practice by Harry Collins), science, like many other areas of modern life, is being transformed by the computer. It is now tough to find any empirical scientific research not touched by the computer in some way, from simply storing and analyzing records to the radical transformation of scientific inquiry through massive multicore simulations of physical systems (see e.g. The Dynamics of Plate Tectonics and Mantle Flow: From Local to Global Scales).

This, combined with the facilitation of digital communication we call the Internet, is an enormous opportunity for scientific advancement. Not because we can collaborate or share our work pre-publication as the articles assert “all of which we can do, if we like” but because computational research captures far more of the tacit knowledge involved in replicating a scientific experiment that ever before, making our science potentially more verifiable. The code and data another scientist believes replicates his or her experiments can capture all digital manipulations of the data, which now comprise much of the scientific discovery process. Commonly used techniques, such as the simulations carried out in papers included in sparselab.stanford.edu (see e.g “Breakdown Point of Model Selection when the Number of Variables Exceeds the Number of Observations” or any of the SparseLab papers), can now be replicated by downloading the short scripts we included with publication (see e.g. Reproducible Research in Computational Harmonic Analysis). This is a far cry from Boyle’s physical experiments with the air pump, and I’d argue one with enormously lower levels of tacit knowledge to communicate that we’re not capitalizing on today.

Computational experiments are complex. Without communicating the data and code that generated the result it is nearly impossible to understand what was done. My thesis advisor famously paraphrased this idea, “An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.”

It is simply wrong to assert that science has not adopted computational tools, as the GigoOM article does when it says “there is another world that makes these industries [traditional media players such as newspapers, magazines and book publishers] look like the most enthusiastic of early adopters [of technological progress]: namely, academic research.” Everywhere you look computation is emerging as central to the scientific method. What needs to happen is a resolution of the credibility crisis in computational science resulting from this enthusiastic technological adoption: we need to establish the routine inclusion of the data and code with publication, such that the results can be conveniently replicated and verified. Many new efforts to help data and code sharing are documented here, from workflow tracking to code portability to unique identifiers that permit the regeneration of results in the cloud.

This is an imperative issue, most of our published computational results today are unverified and unverifiable (I have written on this extensively, as have many others). The emphasis on the process of creative problem solving as the bottleneck in scientific discovery is misplaced. Scientific innovation is inherently creative so perhaps more collaborative tools that permit openness will encourage greater innovation. Or perhaps it is the case, as Paul Graham has pointed out in another context, that innovative thinking is largely a solitary act. What is clear is that it is of primary importance to establish publication practices that facilitate the replication and verification of results by including data and code, and not to confound these two issues.

Posted in Uncategorized | 7 Comments

Don’t expect computer scientists to be on top of every use that’s found for computers, including scientific investigation

Computational scientists need to understand and assert their computational needs, and see that they are met.

I just read this excellent interview with Donald Knuth, inventor of TeX and the concept of literate programming, as well as author of the famous textbook, The Art of Computer Programming. When asked for comments on (the lack of) software development using multicore processing, he says something very interesting, “that multicore technology isn’t that useful, except in a few applications such as rendering graphics, breaking codes, scanning images, simulating physical and biological processes, etc.” This caught my eye because parallel processing is a key advance for data processing. Statistical analysis of data typically executes line by line through the data, making it ideal for multithreaded applications. This isn’t some obscure part of science either – most science carried out today has some element of digital data processing (although of course not always at scales that warrant implementing parallel processing).

Knuth then says that “all these applications [that use parallel processing] require dedicated code and special-purpose techniques, which will need to be changed substantially every few years.” As the state of our scientific knowledge changes so does our problem solving ability, requiring modification of code used to generate scientific discovery. If I’m reading him correctly, Knuth seems to think this makes such applications less relevant to mainstream computer science.

The discussion reminded me of comments made at the Workshop on Algorithms for Modern Massive Datasets at Stanford in June 2010. Researchers in scientific computation (a specialized subdiscipline of computational science, see the Institute for Computational and Mathematical Engineering at Stanford or UT Austin’s Institute for Computational Engineering and Sciences for examples) were lamenting the direction computer hardware architecture was taking toward facilitating certain particular problems, such as particular techniques for matrix inversion and hot topics in linear algebra.

As scientific discovery transforms into a deeply computational process, we computational scientists must be prepared to partner with computer scientists to develop tools suited to the needs of scientific knowledge creation, or develop these skills ourselves. I’ve written elsewhere on the need to develop software that natively supports scientific ends (especially for workflow sharing; see e.g. http://stodden.net/AMP2011 ) and this applies to hardware as well.

Posted in Open Science, Reproducible Research, Uncategorized | 3 Comments

The nature of science in 2051

Right now scientific questions are chosen for study in a largely autocratic way. Typically grants for research on particular questions come from federal funding agencies, and scientists competitively apply with the money going to the chosen researcher via a peer review process.

I suspect, as the tools of online science become increasingly available, the real questions people face in their day to day lives will be more readily answered. If you think about all the things you do and decisions you make in a day, many of them don’t have a strong empirical basis. How you wash the dishes or do laundry, what foods are healthy, what environment to maintain in your house, what common illness remedies work best, who knows, but these types of questions, the ones that occur to you as you go about your daily business, aren’t prioritized in the investigatory model we have now for science. I predict that scientific investigation as a whole, not just that that is government funded, will move substantially toward providing answers to questions of increasingly local importance.

Posted in Open Science, Reproducible Research, Statistics | 1 Comment

Regulatory steps toward open science and reproducibility: we need a science cloud

This past January Obama signed the America COMPETES Re-authorization Act. It contains two interesting sections that advance the notions of open data and the federal role in supporting online access to scientific archives: 103 and 104, which read in part:

“§ 103: The Director [of the Office of Science and Technology Policy at the Whitehouse] shall establish a working group under the National Science and Technology Council with the responsibility to coordinate Federal science agency research and policies related to the dissemination and long-term stewardship of the results of unclassified research, including digital data and peer-reviewed scholarly publications, supported wholly, or in part, by funding from the Federal science agencies.” (emphasis added)

This is a cause for celebration insofar as Congress has recognized that published articles are an incomplete communication of computational scientific knowledge, and the data (and code) must be included as well.

“§ 104: Federal Scientific Collections: The Office of Science and Technology Policy shall develop policies for the management and use of Federal scientific collections to improve the quality, organization, access, including online access, and long-term preservation of such collections for the benefit of the scientific enterprise.” (emphasis added)

I was very happy to see the importance of online access recognized, and hopefully this will include the data and code that underlies published computational results.

One step further in each of these directions: mention code explicitly and create a federally funded cloud not only for data but linked to code and computational results to advance computational reproducibility.

Posted in Law, Open Science, OSTP, Peer Review, Reproducible Research, Scientific Method, Technology | Leave a comment

Generalize clinicaltrials.gov and register research hypotheses before analysis

Stanley Young is Director of Bioinformatics at the National Institute for Statistical Sciences, and gave a talk in 2009 on problems in modern scientific research. For example: 1 in 20 NIH-funded studies actually replicates; closed data and opacity; model selection for significance; multiple comparisons.. Here is the link to his talk: Everything Is Dangerous: A Controversy. There are a number of good examples in the talk and Young anticipates and is more intellectually coherent than the New Yorker article The Truth Wears Off if you were interested in that.

Idea: Generalize clinicaltrials.gov, where scientists register their hypotheses prior to carrying out their experiment. Why not do this for all hypothesis tests? Have a site where the hypotheses are logged and time stamped before researchers gather the data or carry out the actual hypothesis testing for the project. I’ve heard this idea mentioned occasionally and both Young and Lehrer mentions it as well.

Posted in Open Science, Peer Review, Reproducible Research, Scientific Method, Statistics, Talks | Leave a comment

Smart disclosure of gov data

Imagine a cell phone database that includes terms of service, prices, fees, rates, different calling plans, quality of services, coverage maps etc. – “smart disclosure,” as the term is being used in federal government circles, means how to make data available such that it can be used and analyzed. Part of smart disclosure would mean collecting information from consumers as well, such as user experiences, bills, service complaints. This is the vision of the FCC’s chief of their Consumer and Governmental Affairs Bureau Joel Gurin at the Open Gov R&D Summit organized by the Whitehouse. He notes that right away you run into issues of privacy and proprietary data that still need to be worked out.

He gives two examples of when it has worked: healthcare.gov – gov has collected and presented data but become the intermediary in presenting this data [I took a brief look at this site and don’t see where to download data]. Another example is brightscope: they analyzed government released pension and 401(k) fees to create a ranking product they sell to hr managers so that folks can understand the appropriateness of the fees they pay.

The potential is enormous: imagine openness in FCC data. Gurin asks, how do we let many brightscopes bloom?

Christopher Meyer, vice president for external affairs and information services for the Consumers Union, gives an example of failure through database mismanagement. There was a spike in their dataset of consumer complaints about acceleration problems in Toyota cars. They didn’t look at the data and didn’t notice this before Toyota issued the official recall. They’d like to do better, and have better organization in their data and better tools for issue detection through consumer complaints, with a mechanism to permit the manufacturer to respond early.

Posted in Uncategorized | 3 Comments

Open Gov and the VA

Peter Levin is CTO of the Dept of Veteran’s Affairs and has a take on open gov tailored to his department: He’s restructuring the IT infrastructure within the VA to facilitate access. For example, the VA just processed their first paperless claim and is reducing claim turnaround time from 165 days to 40 days.

He is also focusing his efforts on emotional paths to engagement rather than numbers and figures. I hope they can provide both, but I see his comments as a reaction and criticism to open data in general. Levin gives the analogy of the introduction of the telephone – the phone was fundamentally social in nature and hence caught on beyond folks’ expectations, whereas a simply communicator of facts would not. That encapsulates his vision for tech changes at the VA.

James Hamilton of Northwestern suggests the best way to help reporting on government info and the communication of govt activities would be to improve the implementation of the Freedom of Information Act, in particular for journalists. The aim is to improve govt accountability. He also advocates machine learning techniques to automatically analyze comments and draw meaning from data in a variety of formats, like text analysis. He believes this software exists and is in use by the govt (even if that is true I am doubtful of how well it works) and an big improvement would be to make this software open source (he references Gary King’s software on text clustering too, which is open and has been repurposed by AP for example).

George Strawn from the National Coordination Office (NITRD) notes that there are big problem even combining data within agencies, let alone putting together datasets from disparate sources. He says in his experience agency directors aren’t getting the data they need, data that is theoretically available, to make their decisions.

Posted in Uncategorized | Leave a comment

Open Gov Summit: Aneesh Chopra

I’m here at the National Archives attending the Open Government Research and Development Summit, organized by the Office of Science and Technology Policy in the Whitehouse. It’s a series of panel discussions to address questions about the impact and future directions of Obama’s open gov initiative, in particular how to develop a deeper research agenda with the resulting gov data (see the schedule here).

Aneesh Chopra, our country’s first CTO, gave a framing talk in which he listed 5 questions he’d like to have answered through this workshop.

1. big data: how strengthen capacity to understand massive data?
2. new products: what constitutes high value data?
3. open platforms: what are the policy implications of enabling 3rd party apps?
4. international collaboration: what models translate to strengthen democracy internationally?
5. digital norms: what works and what doesn’t work in public engagement?

He hopes the rest of the workshop will not only address these questions and coalesce around recommendations. Chopra wants to be able to set innovation prizes to move towards solutions to these questions.

Posted in Uncategorized | 1 Comment

A case study in the need for open data and code: Forensic Bioinformatics

Here’s a vid of Keith Baggerly explaining his famous case study of why we need code and data to be made openly available in computational science: http://videolectures.net/cancerbioinformatics2010_baggerly_irrh. This is some of the discussion that resulted in the termination of clinical trials at Duke last November and the resignation of Anil Potti. Patients had been assigned into groups and actually given drugs before the trials were stopped. The story is shocking.

It’s also a good example of why traditional publishing doesn’t capture enough detail for reproducibility without the inclusion of data and code. Baggerly’s group at M.D. Anderson was able to make reproducing these results, what he has labeled “forensic biostatistics,” a priority and they spent an enormous amount of time doing this. We certainly need independent verification of results but to do so can often require knowledge of the methodology contained only in the code and data. In addition, Donoho et al (earlier version here) make the point that even when findings are independently replicated, open code and data is necessary to understand the reason for discrepancies in results. In a section in the paper listing and addressing objections we say:

Objection: True Reproducibility Means Reproducibility from First Principles.

Argument: It proves nothing if I point and click and see a bunch of numbers as expected. It only proves something if I start from scratch and build your system and in my implementation I get your results.

Response: If you exactly reproduce my results from scratch, that is quite an achievement! But it proves nothing if your implementation fails to give my results since we won’t know why. The only way we’d ever get to the bottom of such discrepancy is if we both worked reproducibly.

(ps. Audio and slides for a slightly shorter version of Baggerly’s talk here)

Posted in Uncategorized | Leave a comment

Open peer review of science: a possibility

The Nature journal Molecular Systems Biology published an editorial “From Bench to Website” explaining their move to a transparent system of peer review. Anonymous referee reports, editorial decisions, and author responses are published alongside the final published paper. When this exchange is published, care is taken to preserve anonymity of reviewers and to not disclose any unpublished results. Authors also have the ability to opt out and request their review information not be published at all.

Here’s an example of the commentary that is being published alongside the final journal article.

Their move follows on a similar decision taken by The EMBO Journal (European Molecular Biology Organization) as described in an editorial here where they state that the “transparent editorial process will make the process that led to acceptance of a paper accessible to all, as well as any discussion of merits and issues with the paper.” Their reasoning cites problems in the process of scientific communication and they give an example by Martin Raff which was published as a letter to the editor called “Painful Publishing” (behind a paywall, apologies). Raff laments the power of the anonymous reviewers to demand often unwarranted additional experimentation as a condition of publication: “authors are so keen to publish in these select journals that they are willing to carry out extra, time consuming experiments suggested by referees, even when the results could strengthen the conclusions only marginally. All too often, young scientists spend many months doing such ‘referees’ experiments.’ Their time and effort would frequently be better spent trying to move their project forward rather than sideways. There is also an inherent danger in doing experiments to obtain results that a referee demands to see.”

Rick Trebino, physics professor at Georgia Tech, penned a note detailing the often incredible steps he went through in trying to publish a scientific comment: “How to Publish a Scientific Comment in 1 2 3 Easy Steps.” It describes deep problems in our scientific discourse today. The recent clinical trials scandal at Duke University is another example of failed scientific communication. Many efforts were made to print correspondences regarding errors in published papers that may have permitted problems in the research to have been addressed earlier.

The editorial in Molecular Systems Biology also announces that the journal is joining many others in adopting a policy of encouraging the upload of the data that underlies results in the paper to be published alongside the final article. They go one step further and provide links from the figure in the paper to its underlying data. They give an example of such linked figures here. My question is how this dovetails with recent efforts by Donoho and Gavish to create a system of universal figure-level identifiers for published results, and the work of Altman and King to design Universal Numerical Fingerprints (UNFs) for data citation.

Posted in Open Science, Peer Review, Reproducible Research, Technology | 3 Comments

Chris Wiggins: Science is social

I had the pleasure of watching my friend and professor of applied physics and applied math Chris Wiggins give an excellent short talk at NYC’s social media week at Google. The video is available here: http://livestre.am/BUDx.

Chris makes the often forgotten point that science is inherently social. If discoveries aren’t publicly communicated, and hence added to our stock of knowledge, it isn’t science. He notes reproducibility as a manifestation of this openness in communication. (As another example of openness, Karl Popper suggested that if you’re interested in working in an international community, become a scientist.) Chris showcases many new web-based sharing tools and how they augment our fundamental norms, rather than changing them, hence his disagreement with the title of the session “Research Gone Social” in the sense that science has always been social.

Posted in Uncategorized | 1 Comment

Science and Video: a roadmap

Once again I find myself in the position of having collected slides from talks, and having audio from the sessions. I need a simple way to pin these together so they form a coherent narrative and I need a common sharing platform. We don’t really have to see the speaker to understand the message but we needs the slides and the audio to play in tandem with the slides changing at the correct points. Some of the files are quite large: slides decks can be over 100MB and right now the audio file I have is 139MB (slideshare has size limits that don’t accomodate this).

I’m writing because I feel the messages are important, and need to be available to a wider audience. This is often our culture, our heritage, our technology, our scientific knowledge and our shared understanding. These presentations need to be available not just on principled open access grounds, but it is imperative that other scientists hear these messages as well, amplifying scientific communication.

At a bar the other night a friend and I came up with the idea of S-SPAN: a C-SPAN for science. Talks and conferences could be filmed and shared widely on an internet platform. Of course these platforms exist and some even target scientific talks but the content also needs to be marshalled and directed onto the website. Some of the best stuff I’ve even seen has floated into the ether.

So, I make an open call for these two tasks: a simple tool to pin together slides and audio (and sides and video), and an effort to collate video from scientific conference talks and film them if it doesn’t exist, all onto a common distribution platform. S-SPAN could start as raw and underproduced as C-SPAN, but I am sure it would develop from there.

I’m looking at you, YouTube.

Posted in Conferences, Open Science, Talks, Technology | 10 Comments

My Symposium at the AAAS Annual Meeting: The Digitization of Science

Yesterday I held a symposium at the AAAS Annual Meeting in Washington DC, called “The Digitization of Science: Reproducibility and Interdisciplinary Knowledge Transfer,” that was intended to bring attention to how massive computation is changing the practice of science, particularly the lack of reproducibility of published computational scientific results. The fact is, most computational scientific results published today are unverified and unverifiable. I’ve created a page for the event here, with links to slide decks and abstracts. I couldn’t have asked for a better symposium, thanks to the wonderful speakers.

The first speaker was Keith A. Baggerly, who (now famously) tried to verify published results in Nature Medicine and uncovered a series of errors that led to the termination of clinical trials at Duke that were based on the original findings, and the resignation of one of the investigators (his slides). I then spoke about policies for realigning the IP framework scientists are under with their longstanding norms, to permit sharing of code and data (my slides). Fernando Perez described how computational scientists can learn about not only code sharing, quality control, and project management from the Open Source Software, but how they have in fact developed what is in effect a deeply successful system of peer review for code. Code is verified line by line before incorporated into the project, and there are software tools to enable the communication between reviewer and submitted, down to the line of code (his slides).

Michael Reich then presented GenePattern, an OS independent tool developed with Microsoft for creating data analysis pipelines and incorporating them into a Word doc. Once in the document, tools exist to click and recreate the figure from the pipeline and examine what’s been done to the data. Robert Gentlemen advocated the entire research paper as the unit of reproducibility, and David Donoho presented a method for assigning a unique identifier to figures within the paper, that creates a link for each figure and permits its independent reproduction (the slides). The final speaker was Mark Liberman, who showed how the human language technology community had developed a system of open data and code in their efforts to reduce errors in machine understanding of language (his slides). All the talks pushed on delineations of science from non-science, and it was probably best encapsulated with a quote Mark introduced from John Pierce, a Bell Labs executive in 1969, how “To sell suckers, one uses deceit and offers glamor.”

There was some informal feedback, with a prominent person saying that this session was “one of the most amazing set of presentations I have attended in recent memory.” Have a look at all the slides and abstracts, including links and extended abstracts.

Update: Here are some other blog posts on the symposium: Mark Liberman’s blog and Fernando Perez’s blog.

Posted in Conferences, Intellectual Property, Law, Open Science, Reproducible Research, Scientific Method, Software, Statistics, Talks, Technology | Leave a comment

Letter Re Software and Scientific Publications – Nature

Mark Gerstein and I penned a reaction to two pieces published in Nature News last October, “Publish your computer code: it is good enough,” by Nick Barnes and “Computational Science…. Error” by Zeeya Merali. Nature declined to publish our note and so here it is.

Dear Editor,

We have read with great interest the recent pieces in Nature about the importance of computer codes associated with scientific manuscripts. As participants in the Yale roundtable mentioned in one of the pieces, we agree that these codes must be constructed robustly and distributed widely. However, we disagree with an implicit assertion, that the computer codes are a component separate from the actual publication of scientific findings, often neglected in preference to the manuscript text in the race to publish. More and more, the key research results in papers are not fully contained within the small amount of manuscript text allotted to them. That is, the crucial aspects of many Nature papers are often sophisticated computer codes, and these cannot be separated from the prose narrative communicating the results of computational science. If the computer code associated with a manuscript were laid out according to accepted software standards, made openly available, and looked over as thoroughly by the journal as the text in the figure legends, many of the issues alluded to in the two pieces would simply disappear overnight.

The approach taken by the journal Biostatistics serves as an exemplar: code and data are submitted to a designated “reproducibility editor” who tries to replicate the results. If he or she succeeds, the first page of the article is kitemarked “R” (for reproducible) and the code and data made available as part of the publication. We propose that high-quality journals such as Nature not only have editors and reviewers that focus on the prose of a manuscript but also “computational editors” that look over computer codes and verify results. Moreover, many of the points made here in relation to computer codes apply equally well to large datasets that underlie experimental manuscripts. These are often organized, formatted, and deposited into databases as an afterthought. Thus, one could also imagine a “data editor” who would look after these aspects of a manuscript. All in all, we have to come to the realization that current scientific papers are more complicated than just a few thousand words of narrative text and a couple of figures, and we need to update journals to handle this reality.

Yours sincerely,

Mark Gerstein (1,2,3)
Victoria Stodden (4)

(1) Program in Computational Biology and Bioinformatics,
(2) Department of Molecular Biophysics and Biochemistry, and
(3) Department of Computer Science,
Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520 Mark.Gerstein@Yale.edu

(4) Department of Statistics, Columbia University, 1255 Amsterdam Ave, New York, NY 10027
vcs@stodden.net

Posted in Open Science, Reproducible Research, Scientific Method, Software, Technology | 8 Comments

Startups Awash in Data: Quantitative Thinkers Needed

We know unix logs everything, which makes web-based data collection easy, in fact almost difficult not to do. As a result internet startups often find themselves gathering enormous amounts of data, for example site use patterns, click-streams, user demographics and preference functions, purchase histories… Many of these companies know they are sitting on a goldmine, but how to extract the relevant information from these scads of data? More precisely, how to predict user behavior and preferences better?

Statisticians, particularly through machine learning, have been working on this problem for a long time. Since I’ve arrived in New York City from Silicon Valley I’ve observed an enormous amount of quantitative talent here, at least in part due to the influence of the finance industry. But these quantitative skills are precisely what’s needed to make sense of the data collected by startups, and here it looks like NYC has an edge over Silicon Valley. Friends Evan Korth, Hilary Mason, and Chris Wiggins (two professors and a former professor) are building bridges to connect these two worlds. Their primary effort, HackNY, is a summer program linking students with quantitative talent with startups in need. (Wiggins’ mantra is to “get the kids off the street” by giving them alternatives to entering the finance profession.)

The New York startup scene is distinguishing itself from Silicon Valley by efforts to make direct use of the abundance of quantitative skills available here. Hilary and Chris created an excellent guideline for data-driven analysis in the startup context, “A Taxonomy of Data Science:” Obtain, Scrub, Explore, Model, and iNterpret. These data are often measuring phenomena in new ways, using novel data structures, and providing new opportunities for innovative data research and model building. Lots of data, lots of skill – great for statisticians and folks with an interest in learning from data, as well as for those collecting the data.

Posted in Software, Startups, Statistics, Technology | 3 Comments

Open Data Dead on Arrival

In 1984 Karl Popper wrote a private letter to an inquirer he didn’t know, responding to enclosed interview questions. The response was subsequently published and in it he wrote, among other things, that:

“Every intellectual has a very special responsibility. He has the privilege and opportunity of studying. In return, he owes it to his fellow men (or ‘to society’) to represent the results of his study as simply, clearly and modestly as he can. The worst thing that intellectuals can do — the cardinal sin — is to try to set themselves up as great prophets vis-a-vis their fellow men and to impress them with puzzling philosophies. Anyone who cannot speak simply and clearly should say nothing and continue to work until he can do so.”

Aside from the offensive sexism in referring to intellectuals as males, there is another way this imperative should be updated for intellectualism today. The movement to make data available online is picking up momentum — as it should — and open code is following suit (see http://mloss.org for example). But data should not be confused with facts, and applying the simple communication that Popper refers to beyond the written or spoken word is the only way open data will produce dividends. It isn’t enough to post raw data, or undocumented code. Data and code should be considered part of intellectual communication, and made as simple as possible for “fellow men” to understand. Just as knowledge of adequate English vocabulary is assumed in the nonquantitative communication Popper refers to, certain basic coding and data knowledge can be assumed as well. This means the same thing as it does in the literary case; the elimination of extraneous information and obfuscating terminology. No need to bury interested parties in an Enron-like shower of bits. It also means using a format for digital communication that is conducive to reuse, such as a flat text file or another non-proprietary format, for example pdf files cannot be considered acceptable to either data or code. Facilitating reproducibility must be the gold standard for data and code release.

And who are these “fellow men”?

Well, fellow men and women that is, but back to the issue. Much of the history of scientific communication has dealt with the question of demarcation of the appropriate group to whom the reasoning behind the findings would be communicated, the definition of the scientific community. Clearly, communication of very technical and specialized results to a layman would take intellectuals’ time away from doing what they do best, being intellectual. On the other hand some investment in explanation is essential for establishing a finding as an accepted fact — assuring others that sufficient error has been controlled for and eliminated in the process of scientific discovery. These others ought to be able to verify results, find mistakes, and hopefully build on the results (or the gaps in the theory) and thereby further our understanding. So there is a tradeoff. Hence the establishment of the Royal Society for example as a body with the primary purpose of discussing scientific experiments and results. Couple this with Newton’s surprise, or even irritation, at having to explain results he put forth to the Society in his one and only journal publication in their journal Philosophical Transactions (he called the various clarifications tedious, and sought to withdraw from the Royal Society and subsequently never published another journal paper. See the last chapter of The Access Principle). There is a mini-revolution underfoot that has escaped the spotlight of attention on open data, open code, and open scientific literature. That is, the fact that the intent is to open to the public. Not open to peers, or appropriately vetted scientists, or selected ivory tower mates, but to anyone. Never before has the standard for communication been “everyone,” in fact quite the opposite. Efforts had traditionally been expended narrowing and selecting the community privileged enough to participate in scientific discourse.

So what does public openness mean for science?

Recall the leaked files from the University of East Anglia’s Climatic Research Unit last November. Much of the information revealed concerned scientifically suspect (and ethically dubious) attempts not to reveal data and methods underlying published results. Although that tack seems to have softened now some initial responses defended the climate scientists’ right to be closed with regard to their methods due to the possibility of “denial of service attacks” – the ripping apart of methodology (recall all science is wrong, an asymptotic progression toward to truth at best) not with the intent of finding meaningful errors that halt the acceptance of findings as facts, but merely to tie up the climate scientists so they cannot attend to real research. This is the same tradeoff as described above. An interpretation of this situation cannot be made without the complicating realization that peer review — the review process that vets articles for publication — doesn’t check computational results but largely operates as if the papers are expounding results from the pre-computational scientific age. The outcome, if computational methodologies are able to remain closed from view, is that they are directly vetted nowhere. Hardly an acceptable basis for establishing facts. My own view is that data and code must be communicated publicly with attention paid to Popper’s admonition: as simply and clearly as possible, such that the results can be replicated. Not participating in dialog with those insufficiently knowledgable to engage will become part of our scientific norms, in fact this is enshrined in the structure of our scientific societies of old. Others can take up those ends of the discussion, on blogs, in digital forums. But public openness is important not just because taxpayers have a right to what they paid for (perhaps they do, but this quickly falls apart since not all the public are technically taxpayers and that seems a wholly unjust way of deciding who shall have access to scientific knowledge and who not, clearly we mean society), but because of the increasing inclusiveness of the scientific endeavor. How do we determine who is qualified to find errors in our scientific work? We don’t. Real problems will get noticed regardless of with whom they originate, many eyes making all bugs shallow. And I expect peer review for journal publishing to incorporate computational evaluation as well.

Where does this leave all the open data?

Unused, unless efforts are expended to communicate the meaning of the data, and to maximize the usability of the code. Data is not synonymous with facts – methods for understanding data, and turning its contents into facts, are embedded within the documentation and code. Take for granted that users understand the coding language or basic scientific computing functions, but clearly and modestly explain the novel contributions. Facilitate reproducibility. Without this data may be open, but will remain de facto in the ivory tower.

Posted in Developing world, Intellectual Property, Open Science, Reproducible Research, Scientific Method, Software, Statistics, Technology | 6 Comments

Ars technica article on reproducibility in science

John Timmer wrote an excellent article called “Keeping computers from ending science’s reproducibility.” I’m quoted in it. Here’s an excellent follow up blog post by Grant Jacobs, “Reproducible Research and computational biology.”

Posted in Open Science, Reproducible Research, Scientific Method, Software, Statistics, Technology, Uncategorized | Tagged | Leave a comment

Code Repository for Machine Learning: mloss.org

The folks at mloss.org — Machine Leaning Open Source Software — invited a blog post on my roundtable on data and code sharing, held at Yale Law School last November. mloss.org’s philosophy is stated as:

“Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for a wide range of applications. Inspired by similar efforts in bioinformatics (BOSC) or statistics (useR), our aim is to build a forum for open source software in machine learning.”

The site is excellent and worth a visit. The guest blog Chris Wiggins and I wrote starts:

“As pointed out by the authors of the mloss position paper [1] in 2007, “reproducibility of experimental results is a cornerstone of science.” Just as in machine learning, researchers in many computational fields (or in which computation has only recently played a major role) are struggling to reconcile our expectation of reproducibility in science with the reality of ever-growing computational complexity and opacity. [2-12]

In an effort to address these questions from researchers not only from statistical science but from a variety of disciplines, and to discuss possible solutions with representatives from publishing, funding, and legal scholars expert in appropriate licensing for open access, Yale Information Society Project Fellow Victoria Stodden convened a roundtable on the topic on November 21, 2009. Attendees included statistical scientists such as Robert Gentleman (co-developer of R) and David Donoho, among others.”

keep reading at http://mloss.org/community/blog/2010/jan/26/data-and-code-sharing-roundtable/. We made an effort to reference efforts in other fields regarding reproducibility in computational science.

Posted in Conferences, Open Science, Reproducible Research, Software | Leave a comment

Video from "The Great Climategate Debate" held at MIT December 10, 2009

This is an excellent panel discussion regarding the leaked East Anglia docs as well as standards in science and the meaning of the scientific method. It was recorded on Dec 10, 2009, and here’s the description from the MIT World website: “The hacking of emails from the University of East Anglia’s Climate Research Unit in November rocked the world of climate change science, energized global warming skeptics, and threatened to derail policy negotiations at Copenhagen. These panelists, who differ on the scientific implications of the released emails, generally agree that the episode will have long-term consequences for the larger scientific community.”

Moderator: Henry D. Jacoby, Professor of Management, MIT Sloan School of Management, and Co-Director, Joint Program on the Science and Policy of Global Change, MIT.

Panelists:
Kerry Emanuel, Breene M. Kerr Professor of Atmospheric Science, Department of Earth, Atmospheric Science and Planetary Sciences, MIT;
Judith Layzer, Edward and Joyce Linde Career Development Associate Professor of Environmental Policy, Department of Urban Studies and Planning, MIT;
Stephen Ansolabehere, Professor of Political Science, MIT, and
Professor of Government, Harvard University;
Ronald G. Prinn, TEPCO Professor of Atmospheric Science, Department of Earth, Atmospheric and Planetary Sciences, MIT Director, Center for Global Change Science; Co-Director of the MIT Joint Program on the Science and Policy of Global Change;
Richard Lindzen, Alfred P. Sloan Professor of Meteorology, Department of Earth, Atmospheric and Planetary Sciences, MIT.

Video, running at nearly 2 hours, is available at http://mitworld.mit.edu/video/730.

Posted in Open Science, Reproducible Research, Scientific Method, Talks | Leave a comment

My answer to the Edge Annual Question 2010: How is the Internet Changing the Way You Think?

At the end of every year editors at my favorite website The Edge ask intellectuals to answer a thought-provoking question. This year it was “How is the internet changing the way you think?” My answer is posted here:
http://www.edge.org/q2010/q10_15.html#stodden

Posted in Open Science, Reproducible Research, Scientific Method, shameless self-promotion, Statistics, Technology | Leave a comment

Post 3: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the OSTP’s call as posted here: http://www.ostp.gov/galleries/default-file/RFI%20Final%20for%20FR.pdf. The first wave, comments posted here, asked for feedback on implementation issues. The second wave requested input on Features and Technology (our post is here). For the third and final wave on Management, Chris Wiggins, Matt Knepley, and I posted the following comments:

Q1: Compliance. What features does a public access policy need to ensure compliance? Should this vary across agencies?

One size does not fit all research problems across all research communities, and a heavy-handed general release requirement across agencies could result in de jure compliance – release of data and code as per the letter of the law – without the extra effort necessary to create usable data and code facilitating reproducibility (and extension) of the results. One solution to this barrier would be to require grant applicants to formulate plans for release of the code and data generated through their research proposal, if funded. This creates a natural mechanism by which grantees (and peer reviewers), who best know their own research environments and community norms, contribute complete strategies for release. This would allow federal funding agencies to gather data on needs for release (repositories, further support, etc.); understand which research problem characteristics engender which particular solutions, which solutions are most appropriate in which settings, and uncover as-yet unrecognized problems particular researchers may encounter. These data would permit federal funding agencies to craft release requirements that are more sensitive to barriers researchers face and the demands of their particular research problems, and implement strategies for enforcement of these requirements. This approach also permits researchers to address confidentiality and privacy issues associated with their research.

Examples:

One exemplary precedent by a UK funding agency is the January 2007 “Policy on data management and sharing”
(http://www.wellcome.ac.uk/About-us/Policy/Policy-and-position-statements/WTX035043.htm)
adopted by The Wellcome Trust (http://www.wellcome.ac.uk/About-us/index.htm) according to which “the Trust will require that the applicants provide a data management and sharing plan as part of their application; and review these data management and sharing plans, including any costs involved in delivering them, as an integral part of the funding decision.” A comparable policy statement by US agencies would be quite useful in clarifying OSTP’s intent regarding the relationship between publicly-supported research and public access to the research products generated by this support.

Continue reading

Posted in Intellectual Property, Law, Open Science, OSTP, Reproducible Research, Scientific Method, Uncategorized | Leave a comment

Post 2: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the second wave of the OSTP’s call as posted here: http://www.ostp.gov/galleries/default-file/RFI%20Final%20for%20FR.pdf. The first wave, comments posted here and on the OSTP site here (scroll to the second last comment), asked for feedback on implementation issues. The second wave requests input on Features and Technology and Chris Wiggins and I posted the following comments:

We address each of the questions for phase two of OSTP’s forum on public access in turn. The answers generally depend on the community involved and (particularly question 7, asking for a cost estimate) on the scale of implementation. Inter-agency coordination is crucial however in (i) providing a centralized repository to access agency-funded research output and (ii) encouraging and/or providing a standardized tagging vocabulary and structure (as discussed further below).

Continue reading

Posted in Intellectual Property, Law, Open Science, OSTP, Reproducible Research, Scientific Method, Software, Statistics, Technology | 4 Comments

Nathan Myhrvold advocates for Reproducible Research on CNN

On yesterday’s edition of Fareed Zakaria’s GPS on CNN former Microsoft CTO and current CEO of Intellectual Ventures Nathan Myhrvold said reproducible research is an important response for climate science in the wake of Climategate, the recent file leak from a major climate modeling center in England (I blogged my response to the leak here). The video is here, see especially 16:27, and the transcript is here.

Posted in Intellectual Property, Open Science, Reproducible Research, Scientific Method | Leave a comment

The OSTP's call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the OSTP’s call as posted here: http://www.ostp.gov/galleries/default-file/RFI%20Final%20for%20FR.pdf:

Open access to our body of federally funded research, including not only published papers but also any supporting data and code, is imperative, not just for scientific progress but for the integrity of the research itself. We list below nine focus areas and recommendations for action.

Continue reading

Posted in Intellectual Property, Law, Open Science, OSTP, Reproducible Research, Scientific Method, Uncategorized | 4 Comments

The Climate Modeling Leak: Code and Data Generating Published Results Must be Open and Facilitate Reproducibility

On November 20 documents including email and code spanning more than a decade were leaked from the Climatic Research Unit (CRU) at East Anglia University in the UK.

The Leak Reveals a Failure of Reproducibility of Computational Results

It appears as though the leak came about through a long battle to get the CRU scientists to reveal the code and data associated with published results, and highlights a crack in the scientific method as practiced in computational science. Publishing standards have not yet adapted to the relatively new computational methods used pervasively across scientific research today.

Other branches of science have long-established methods to bring reproducibility into their practice. Deductive or mathematical results are published only with proofs, and there are long established standards for an acceptable proof. Empirical science contains clear mechanisms for communication of methods with the goal of facilitation of replication. Computational methods are a relatively new addition to a scientist’s toolkit, and the scientific community is only just establishing similar standards for verification and reproducibility in this new context. Peer review and journal publishing have generally not yet adapted to the use of computational methods and still operate as suitable for the deductive or empirical branches, creating a growing credibility gap in computational science.

Verifying Computational Results without Clear Communication of the Steps Taken is Near-Impossible

The frequent near-impossibility of verification of computational results when reproducibility is not considered a research goal is shown by the miserable travails of “Harry,” a CRU employee with access to their system who was trying to reproduce the temperature results. The leaked documents contain logs of his unsuccessful attempts. Harry apparently was unable to reproduce CRU’s published results after four years of documented effort.

This example also illustrates why a decision to leave reproducibility to others, beyond a cursory description of methods in the published text, is wholly inadequate for computational science. Harry seems to have had access to the data and code used and couldn’t computationally replicate the results. The merging and preprocessing of data in preparation for modeling and estimation encompasses a potentially very large number of steps, and a change in any one could produce different results. Fitting models, running simulations, parameter settings, and function invocation sequences must be communicated, again because the final results are a culmination of many decisions and each small step must match the original work – a Herculean task. Responding with raw data when questioned about computational results is likely a canard, not intended to seriously facilitate computational reproducibility.

The story of Penn State professor of meteorology Michael Mann‘s famous hockey stick temperature time series estimates is an example where lack of verifiability had important consequences. Release of the code and data used to generate the results in the hockey stick paper would perhaps have avoided the convening of panels to assess the papers. The hockey stick is a dramatic illustration of global warming and became something of an informal logo for the U.N.’s Intergovernmental Panel of Climate Change (IPCC). Mann was an author of the 2001 IPCC Assessment report, and was a lead author on the “Copenhagen Diagnosis,” a report released Nov 24 and intended to synthesize the hundreds of research papers about human-induced climate change that have been published since the last assessment by the IPCC two years ago. The report was prepared in advance of the Copenhagen climate summit scheduled for Dec 7-18. Emails between CRU researchers and Mann are included in the leak, which happened right before the release of the Copenhagen Diagnosis (a quick search of the leaked emails for “Mann” provided 489 matches).

These reports are important in part because of their impact on policy, as CBS news reports, “In global warming circles, the CRU wields outsize influence: it claims the world’s largest temperature data set, and its work and mathematical models were incorporated into the United Nations Intergovernmental Panel on Climate Change’s 2007 report. That report, in turn, is what the Environmental Protection Agency acknowledged it “relies on most heavily” when concluding that carbon dioxide emissions endanger public health and should be regulated.”

Discussions of Appropriate Level of Code and Data Disclosure on RealClimate.org, Before and After the CRU Leak

For years researchers had requested the data and programs used to produce Mann’s Hockey Stick result, and were resisted. The repeated requests for code and data culminated in Freedom of Information (FOI) requests, in particular those made by Willis Eschenbach, who tells his story of requests he made for underlying code and data up until the time of the leak. It appears that a file, FOI2009.zip, was placed on CRU’s FTP server and then comments alerting people to its existence were posted on several key blogs.

The importance of disclosure of code and data in one part of the climate change community is illustrated in this fascinating discussion on the blog RealClimate.org in February. (Thank you to Michael Nielsen for the pointer.) RealClimate.org has 5 primary authors, one of whom is Michael Mann, and its primary author is Gavin Schmidt. In this RealClimate blog post from November 27, Where’s the Data, the position seems to be now very much all in favor of data release, but the first comment asks for the steps taken in reconstructing the results as well. This is right – reproducibility of results should be the concern (as argued here for example).

Policy and Public Relations

A dangerous ramification from the leak could be an undermining of public confidence in science and the conduct of scientists. My sense is that making code and data readily available in a way that facilitates reproducibility of results, can help avoid distractions from the real science, such as potential evasions of FOIA requests, whether or not data were fudged, or scientists acted improperly in squelching dissent or manipulating journal editorial boards. Perhaps data release is becoming an accepted norm, but code release for computational reproducibility must follow. The issue here is verification and reproducibility, which is important for understanding and assessing whether the core science done at CRU was correct or not, even for peer reviewing scientists.

Posted in Law, Open Science, Reproducible Research, Scientific Method, Software, Statistics, Technology, Uncategorized | 7 Comments

Software and Intellectual Lock-in in Science

In a recent discussion with a friend, a hypothesis occurred to me: that increased levels of computation in scientific research could cause greater intellectual lock-in to particular ideas.

Examining how ideas change in scientific thinking isn’t new. Thomas Kuhn for example caused a revolution himself in how scientific progress is understood with his 1962 book The Structure of Scientific Revolutions. The notion of technological lock-in isn’t new either, see for example Paul David’s examination of how we ended up with the non-optimal QWERTY keyboard (“Clio and the Economics of QWERTY,” AER, 75(2), 1985) or Brian Arthur’s “Competing Technologies and Lock-in by Historical Events: The Dynamics of Allocation Under Increasing Returns” (Economic Journal, 99, 1989).

Computer-based methods are relatively new to scientific research, and are reaching even the most seemingly uncomputational edges of the humanities, like English literature and archaeology. Did Shakespeare really write all the plays attributed to him? Let’s see if word distributions by play are significantly different; or can we use signal processing to “see” artifacts without unearthing them, and thereby preserving artifact features?

Software has the property of encapsulating ideas and methods for scientific problem solving. Software also has a second property: brittleness, it breaks before it bends. Computing hardware has grown steadily in capability, speed, reliability, and capacity, but as Jaron Lanier describes in his essay on The Edge, trends in software are “a macabre parody of Moore’s Law” and the “moment programs grow beyond smallness, their brittleness becomes the most prominent feature, and software engineering becomes Sisyphean.” My concern is that as ideas become increasingly manifest as code, with all the scientific advancement that can imply, it becomes more difficult to adapt, modify, and change the underlying scientific approaches. We become, as scientists, more locked into particular methods for solving scientific questions and particular ways of thinking.

For example, what happens when an approach to solving a problem is encoded in software and becomes a standard tool? Many such tools exist, and are vital to research – just look at the list at Andrej Sali’s highly regarded lab at UCSF, or the statistical packages in the widely used language R, for example. David Donoho laments the now widespread use of test cases he released online to illustrate his methods for particular types of data, “I have seen numerous papers and conference presentations referring to “Blocks,” “Bumps,” “HeaviSine,” and “Doppler” as standards of a sort (this is a practice I object to but am powerless to stop; I wish people would develop new test cases which are more appropriate to illustrate the methodology they are developing).” Code and ideas should be reused and built upon, but at what point does the cost of recoding outweigh the scientific cost of not improving the method? In fact, perhaps counterintuitively, it’s hardware that is routinely upgraded and replaced, not the seemingly ephemeral software.

In his essay Lanier argues that the brittle state of software today results from metaphors used by the first computer scientists – electronic communications devices that sent signals on a wire. It’s an example of intellectual lock-in itself that’s become hardened in how we encode ideas as machine instructions now.

Posted in Open Science, Reproducible Research, Scientific Method, Software, Technology | 2 Comments

My Interview with ITConversations on Reproducible Research

On September 30, I was interviewed by Jon Udell from ITConversations.org in his Interviews with Innovators series, on Reproducibility of Computational Science.

Here’s the blurb: “If you’re a writer, a musician, or an artist, you can use Creative Commons licenses to share your digital works. But how can scientists license their work for sharing? In this conversation, Victoria Stodden — a fellow with Science Commons — explains to host Jon Udell why scientific output is different and how Science Commons aims to help scientists share it freely.”

Posted in Intellectual Property, Law, Open Science, Reproducible Research, Scientific Method, shameless self-promotion, Statistics, Technology | 2 Comments

Optimal Information Disclosure Levels: Data.gov and "Taleb's Criticism"

I was listening to the audio recording of last Friday’s :Scientific Data for Evidence Based Policy and Decision Making symposium at the National Academies, and was struck by the earnest effort on the part of members of the Whitehouse to release governmental data to the public. Beth Noveck, Obama’s Deputy Chief Technology Officer for Open Government, frames the effort with a slogan, “Transparency, Participation, and Collaboration.” A plan is being developed by the Whitehouse in collaboration with the OMB to implement these three principles via a “massive release of data in open, downloadable, accessible for machine readable formats, across all agencies, not only in the Whitehouse,” says Beth. “At the heart of this commitment to transparency is a commitment to open data and open information..”

Vivek Kundra, Chief Information Officer in the Whitehouse’s Open Government Initiative, was even more explicit – saying that “the dream here is that you have a grad student, sifting through these datasets at 3 in the morning, who finds, at the intersection of multiple datasets, insight that we may not have seen, or developed a solution that we may not have thought of.”

This is an extraordinary vision. This discussion comes hot on the heels of a debate in Congress regarding the level of information they are willing to release to the public in advance of voting on a bill. Last Wednesday CBS reports, with regard to the health care bill, that “[t]he Senate Finance Committee considered for two hours today a Republican amendment — which was ultimately rejected — that would have required the “legislative” language of the committee’s final bill, along with a cost estimate for the bill, to be posted online for 72 hours before the committee voted on it. Instead, the committee passed a similar amendment, offered by Committee Chair Max Baucus (D-Mont.), to put online the “conceptual” or “plain” language of the bill, along with the cost estimate.” What is remarkable is the sense this gives that somehow the public won’t understand the raw text of the bill (I noticed no compromise position offered that would make both versions available, which seems an obvious solution).

The Whitehouse’s efforts have the potential to test this hypothesis: if given more information will people pull things out of context and promulgate misinformation? The Whitehouse is betting that they won’t, and Kundra does state the Whitehouse is accompanying dataset release with efforts to provide contextual meta-data for each dataset while safeguarding national security and individual privacy rights.

This sense of limits in openness isn’t unique to governmental issues and in my research on data and code sharing among scientists I’ve termed the concern “Taleb’s crticism.” In a 2008 essay on The Edge website, Taleb worries about the dangers that can result from people using statistical methodology without having a clear understanding of the techniques. An example of concern about Taleb’s Criticism appeared on UCSF’s EVA website, a repository of programs for automatic protein structure prediction. The UCSF researchers won’t release their code publicly because, as stated on their website, “We are seriously concerned about the ‘negative’ aspect of the freedom of the Web being that any newcomer can spend a day and hack out a program that predicts 3D structure, put it on the web, and it will be used.” Like the congressmen seemed to fear, for these folks openness is scary because people may misuse the information.

It could be argued, and for scientific research should be argued, that an open dialog of an idea’s merits is preferable to no dialog at all, and misinformation can be countered and exposed. Justice Brandeis famously elucidated this point in Whitney v. California (1927), writing that “If there be time to expose through discussion the falsehood and fallacies, to avert the evil by the processes of education, the remedy to be applied is more speech, not enforced silence.” Data.gov is an experiment in context and may bolster trust in the public release of complex information. Speaking of the Data.gov project, Noveck explained that “the notion of making complex information more accessible to people and to make greater sense of that complex information was really at the heart.” This is a very bold move and it will be fascinating to see the outcome.

Crossposted on Yale Law School’s Information Society Project blog.

Posted in Economics, Internet and Democracy, Law, Open Science, Reproducible Research, Scientific Method, Statistics, Technology | 2 Comments