Peanut allergic reaction

I’ve used this blog for my professional interests thinking that my personal life just isn’t all that interesting. I still don’t think my personal life is of broad interest, but I’m going to describe what happened to me after an accidental exposure to peanuts yesterday. I’m motivated for two reasons. One, I couldn’t find much personal discussion of these allergic responses and I think it would be helpful to have more (there was lots of discussion from moms, or potential causes, or badly written pseudo science but very few actual stories). Two, I was supposed to meet a friend last night and had to cancel because of this, and his reaction made me realize that these severe allergic reactions don’t seem to be well understood or accepted in general.

After the jump I’ll go into detail. If you are squeamish don’t read, and/or if you know me professionally you may not want to continue, in order to permit some dignity to persist in our future interactions. But I think the story is important for others who suffer from this. I remember lying there wondering if this or that that was happening was normal. Turn out, yes, but that’s not easy info to find.

Continue reading ‘Peanut allergic reaction’

What the Reinhart & Rogoff Debacle Really Shows: Verifying Empirical Results Needs to be Routine

There’s been an enormous amount of buzz since a study was released this week questioning the methodology in a published paper. The paper under fire is Reinhart and Rogoff’s “Growth in a Time of Debt” and the firing is being done by Herndon, Ash, and Pollin in their article “Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff.” Herndon, Ash, and Pollin claim to have found “spreadsheet errors, omission of available data, weighting, and transcription” in the original research which, when corrected, significantly reduce the magnitude of the original findings. These corrections were possible because of openness in economics, and this openness needs to be extended to make all computational publications reproducible.

How did this come about?

In 1986 a study was published in The American Economic Review (the very same journal that published Reinhart and Rogoff’s piece) by Dewald, Thursby, and Anderson called “Replication in Empirical Economics: The Journal of Money, Credit and Banking Project” detailing shocking results: of 152 papers published or to be published in the Journal of Money, Credit and Banking between 1982-1984, only 4 could essentially be replicated in their entirety (the authors only received sufficient data and code for 9). The reason the JMCB was selected for study was their novel (at the time) policy of requiring all authors to relinquish data and code, to be made available to other researchers. These stark results sent shockwaves through the economics community and many journals, including the AER, subsequently implemented data and code release requirements. Economists also became more aware of the issue of reproducibility and many do release the data and code associated with their published studies.

Herndon, Ash, and Pollin obtained the spreadsheet Reinhart and Rogoff used by direct request, and RR had made the raw data available for download on a website apparently set up for their collaborations http://www.reinhartandrogoff.com/data/browse-by-topic/topics/9/. The code used to transform the raw data to the published findings was not made available. Herndon et al followed the methodological description in the paper for both the raw data and the working spreadsheet RR supplied and as is well known came to different conclusions (although they were largely able to replicate the published results from the working spreadsheet).

What about peer review?

A reasonable question is how these results could have passed peer review, if as Herndon et al claim there were errors in the spreadsheet and methodological liberties taken such as selective data exclusion and unconventional weighting of summary statistics. The RR article wasn’t peer reviewed, appearing in AER’s Papers and Proceedings issue. But regardless, unlike for a mathematical result, peer review generally doesn’t verify computational results. Reviewers don’t usually check whether the findings in a computational paper can be replicated or whether their derivation actually matches that described in the paper. As the JMCB study showed, not having access to the code and data pretty much makes it impossible to replicate findings.

Proposed Change 1: Reviewers need have a way to check that any computational results were actually generated as described.

This is typically nontrivial, since having the code and data doesn’t guarantee replication is either possible or achievable without significant effort. I have been working on a not-for-profit project called RunMyCode.org which could help reviewers by providing a certification that the code and data do regenerate the tables and figures in the paper. The site provides a web interface that permits users to regenerate the published results, and download the code and data.

I am puzzled as to why Herndon et al didn’t rely on AER’s stated policy that “[a]s soon as possible after acceptance, authors are expected to send their data, programs, and sufficient details to permit replication, in electronic form, to the AER office.” Nowhere is an exception for their Papers and Proceedings issue listed, and so AER should have both the data and programs needed to replicate the RR paper.

Reproducible Research is a Necessary Standard

I suspect the reason Reinhold and Rogoff’s work was able to be scrutinized at this level is actually because of the culture of code and data sharing in economics. Before the Herndon publication the raw data were freely available on RR’s website, and when requested they supplied their working excel spreadsheet to Herndon et al. RR have come under criticism because they never released the program that showed the steps they took from the raw data they posted to their published results, and didn’t make their spreadsheet freely available, only the raw data. It’s worth keeping in mind that without their release of the working spreadsheet, it is likely their mistakes would not have been found. Now imagine how many publications don’t have data and code available, and cannot be checked at the level RR’s was, and how many mistakes in the scholarly record just aren’t being caught.

Proposed Change 2: At the time of publication, researchers make enough material openly available (data, programs, narrative) so that other researchers in the field can replicate their work.

What it takes to replicate results is a subjective judgement but the computer programs, and the data they start from, is a minimum. This doesn’t guarantee the results are correct, but permits others to understand what was done. RR’s article was highly cited and widely reported in the press, and it appears no one had ever bothered to check the results before the Herndon paper. Perhaps they assumed peer review had done so, but whatever the reason independent researchers must be able to validate and verify published results and researchers must facilitate this when they publish findings as a routine effort. There may be confidentiality issues or other reasons data may not be openly sharable, but the default needs to be conveniently available data and code, for every publication. I recently co-organized a workshop around this issue and the organizers released a workshop report “Setting the Default to Reproducible.”

Proposed Change 3: Only use research tools that can track the steps taken in generating results. Carry out research using R or python for example, not Excel.

I also suspect that if Reinhold and Rogoff had been in the habit of making their data and code available so that their results were reproducible, they would have caught such errors themselves before publication. If sharing had been taken seriously perhaps they would have been motivated to use tools more conducive to scientific research such as those that capture all the steps taken, like R or python can. By using Excel, which was never designed for scientific research, they institutionalized mouse clicks and other untraceable actions into a scientific workflow, which must be avoided since it makes explaining to others (and to oneself) how to replicate the findings next to impossible and too easily introduces inadvertent mistakes.

Data access going the way of journal article access? Insist on open data

The discussion around open access to published scientific results, the Open Access movement, is well known. The primary cause of the current situation — journal publishers owning copyright on journal articles and therefore charging for access — stems from authors signing their copyright over to the journals. I believe this happened because authors really didn’t realize what they were doing when they signed away ownership over their work, and had they known they would not have done so. I believe another solution would have been used, such as granting the journal a license to publish i.e. like Science’s readily available alternative license. At some level authors were entering into binding legal contracts without an understanding of the implications and without the right counsel.

I am seeing a similar situation arising with respect to data. It is not atypical for a data producing entity, particularly those in the commercial sphere, to require that researchers with access to the data sign a non-disclosure agreement. This seems to be standard for Facebook data, Elsevier data, and many many others. I’m witnessing researchers grabbing their pens and signing, and like in the publication context, feeling themselves powerless to do otherwise. Again, they are without the appropriate counsel. Even the general counsel’s office at their institution typically sees the GC’s role as protecting the institution against liability, rather than the larger concern of protecting the scholar’s work and the integrity of the scholarly record. What happens when research from these protected datasets is published, and questioned? How can others independently verify the findings? They’ll need access to the data.

There are many legitimate reasons such data may not be able to be publicly released, for example protection of subjects’ privacy (see what happened when Harvard released Facebook data from a study). But as scientists we should be mindful of the need for our published findings to be reproducible. Some commercial data do not come with privacy concerns, only concerns from the company that they are still able to sell the data to other commercial entities, and sometimes not even that. Sometimes lawyers simply want an NDA to minimize any risk to the commercial entity that might arise should the data be released. To me, that seems perfectly rational since they are not stewards of scientific knowledge.

It is also perfectly rational for authors publishing findings based on these data to push back as hard as possible to ensure maximum reproducibility and credibility of their results. Many companies share data with scientists because they seek to deepen goodwill and ties with the academic community, or they are interested in the results of the research. As researchers we should condition our acceptance of the data on its release when the findings are published, if there are no privacy concerns associated with the data. If there are privacy concerns I can imagine ensuring we can share the data in a “walled garden” within which other researchers, but not the public, will be able to access the data and verify results. There are a number of solutions that can bridge the gap between open access to data and an access-blocking NDA (e.g. differential privacy) and as scientists the integrity and reproducibility of our work is a core concern that we have responsibility for in this negotiation for data.

A few template data sharing agreements between academic researchers and data producing companies would be very helpful, if anyone feels like taking a crack at drafting them (Creative Commons?). Awareness of the issue is also important, among researchers, publishers, funders, and data producing entities. We cannot unthinkingly default to a legal situation regarding data that is anathema to scientific progress, as we did with access to scholarly publications.

Getting Beyond Marketing: Scan and Tell

I love this idea: http://nomoresopa.com/wp/. It’s an Android app that allows you to scan a product’s barcode and it will tell you whether the company that makes the product supports the Stop Online Piracy Act. What’s really happening here is the ability to get product information at the time of purchase decision. You could, for example, find out what a manufacturer’s parent companies are. Did you know that Cascadian Farm, makers of breakfast cereals and carried by Whole Foods, is owned by General Mills? This information is easily found on the General Mills website but not obvious when you’re looking at the box in the store. This kind of information can now be made readily available to consumers so they can make better and hopefully less biased choices. Love it.

Disrupt science: But not how you’d think

Two recent articles call for an openness revolution in science: one on GigaOM and the other in the Wall Street Journal. But they’ve got it all wrong. These folks are missing that the process of scientific discovery is not, at its core, an open process. It only becomes an open process at the point of publication.

I am not necessarily in favor of greater openness during the process of scientific collaboration. I am however necessarily in favor of openness in communication of the discoveries at the time of publication. Publication is the point at which authors feel their work is ready for public presentation and scrutiny (that traditional publication does not actually give the public access to new scientific knowledge is a tragedy and of course we should have Open Access). We even have a standard for the level of openness required at the time of scientific publication: reproducibility. This concept has been part of the scientific method since it was instigated by Robert Boyle in the 1660’s: communicate sufficient information such that another researcher in the field can replicate your work, given appropriate tools and equipment.

If we’ve already been doing this for hundreds of years what’s the big deal now? Leaving aside the question of whether or not published scientific findings have actually been reproducible over the last few hundred years (for work on this question see e.g. Changing Order: Replication and Induction in Scientific Practice by Harry Collins), science, like many other areas of modern life, is being transformed by the computer. It is now tough to find any empirical scientific research not touched by the computer in some way, from simply storing and analyzing records to the radical transformation of scientific inquiry through massive multicore simulations of physical systems (see e.g. The Dynamics of Plate Tectonics and Mantle Flow: From Local to Global Scales).

This, combined with the facilitation of digital communication we call the Internet, is an enormous opportunity for scientific advancement. Not because we can collaborate or share our work pre-publication as the articles assert – all of which we can do, if we like – but because computational research captures far more of the tacit knowledge involved in replicating a scientific experiment that ever before, making our science potentially more verifiable. The code and data another scientist believes replicates his or her experiments can capture all digital manipulations of the data, which now comprise much of the scientific discovery process. Commonly used techniques, such as the simulations carried out in papers included in sparselab.stanford.edu (see e.g “Breakdown Point of Model Selection when the Number of Variables Exceeds the Number of Observations” or any of the SparseLab papers), can now be replicated by downloading the short scripts we included with publication (see e.g. Reproducible Research in Computational Harmonic Analysis). This is a far cry from Boyle’s physical experiments with the air pump, and I’d argue one with enormously lower levels of tacit knowledge to communicate that we’re not capitalizing on today.

Computational experiments are complex. Without communicating the data and code that generated the result it is nearly impossible to understand what was done. My thesis advisor famously paraphrased this idea, “An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.”

It is simply wrong to assert that science has not adopted computational tools, as the GigoOM article does when it says “there is another world that makes these industries [traditional media players such as newspapers, magazines and book publishers] look like the most enthusiastic of early adopters [of technological progress]: namely, academic research.” Everywhere you look computation is emerging as central to the scientific method. What needs to happen is a resolution of the credibility crisis in computational science resulting from this enthusiastic technological adoption: we need to establish the routine inclusion of the data and code with publication, such that the results can be conveniently replicated and verified. Many new efforts to help data and code sharing are documented here, from workflow tracking to code portability to unique identifiers that permit the regeneration of results in the cloud.

This is an imperative issue, most of our published computational results today are unverified and unverifiable (I have written on this extensively, as have many others). The emphasis on the process of creative problem solving as the bottleneck in scientific discovery is misplaced. Scientific innovation is inherently creative – perhaps more collaborative tools that permit openness will encourage greater innovation. Or perhaps it is the case, as Paul Graham has pointed out in another context, that innovative thinking is largely a solitary act. What is clear is that it is of primary importance to establish publication practices that facilitate the replication and verification of results by including data and code, and not to confound these two issues.

Don’t expect computer scientists to be on top of every use that’s found for computers, including scientific investigation

Computational scientists need to understand and assert their computational needs, and see that they are met.

I just read this excellent interview with Donald Knuth, inventor of TeX and the concept of literature literate programming, as well as author of the famous textbook, The Art of Computer Programming. When asked for comments on (the lack of) software development using multicore processing, he says something very interesting – that multicore technology isn’t that useful, except in a few applications such as “rendering graphics, breaking codes, scanning images, simulating physical and biological processes, etc.” This caught my eye because parallel processing is a key advance for data processing. Statistical analysis of data typically executes line by line through the data, making it ideal for multithreaded applications. This isn’t some obscure part of science either – most science carried out today has some element of digital data processing (although of course not always at scales that warrant implementing parallel processing).

Knuth then says that “all these applications [that use parallel processing] require dedicated code and special-purpose techniques, which will need to be changed substantially every few years.” As the state of our scientific knowledge changes so does our problem solving ability, requiring modification of code used to generate scientific discovery. If I’m reading him correctly, Knuth seems to think this makes such applications less relevant to mainstream computer science.

The discussion reminded me of comments made at the “Workshop on Algorithms for Modern Massive Datasets” at Stanford in June 2010. Researchers in scientific computation (a specialized subdiscipline of computational science, see the Institute for Computational and Mathematical Engineering at Stanford or UT Austin’s Institute for Computational Engineering and Sciences for examples) were lamenting the direction computer hardware architecture was taking toward facilitating certain particular problems, such as particular techniques for matrix inversion and hot topics in linear algebra.

As scientific discovery transforms into a deeply computational process, we computational scientists must be prepared to partner with computer scientists to develop tools suited to the needs of scientific knowledge creation, or develop these skills ourselves. I’ve written elsewhere on the need to develop software that natively supports scientific ends (especially for workflow sharing; see e.g. http://stodden.net/AMP2011 ) and this applies to hardware as well.

The nature of science in 2051

Right now scientific questions are chosen for study in a largely autocratic way. Typically grants for research on particular questions come from federal funding agencies, and scientists competitively apply with the money going to the chosen researcher via a peer review process.

I suspect, as the tools of online science become increasingly available, the real questions people face in their day to day lives will be more readily answered. If you think about all the things you do and decisions you make in a day, many of them don’t have a strong empirical basis. How you wash the dishes or do laundry, what foods are healthy, what environment to maintain in your house, what common illness remedies work best, who knows, but these types of questions, the ones that occur to you as you go about your daily business, aren’t prioritized in the investigatory model we have now for science. I predict that scientific investigation as a whole, not just that that is government funded, will move substantially toward providing answers to questions of local importance.

Regulatory steps toward open science and reproducibility: we need a science cloud

This past January Obama signed the America COMPETES Re-authorization Act. It contains two interesting sections that advance the notions of open data and the federal role in supporting online access to scientific archives: 103 and 104, which read in part:

• § 103: “The Director [of the Office of Science and Technology Policy at the Whitehouse] shall establish a working group under the National Science and Technology Council with the responsibility to coordinate Federal science agency research and policies related to the dissemination and long-term stewardship of the results of unclassified research, including digital data and peer-reviewed scholarly publications, supported wholly, or in part, by funding from the Federal science agencies.” (emphasis added)

This is a cause for celebration insofar as Congress has recognized that published articles are an incomplete communication of computational scientific knowledge, and the data (and code) must be included as well.

• § 104: Federal Scientific Collections: The Office of Science and Technology Policy “shall develop policies for the management and use of Federal scientific collections to improve the quality, organization, access, including online access, and long-term preservation of such collections for the benefit of the scientific enterprise.” (emphasis added)

I was very happy to see the importance of online access recognized, and hopefully this will include the data and code that underlies published computational results.

One step further in each of these directions: mention code explicitly and create a federally funded cloud not only for data but linked to code and computational results to enable reproducibility.

Generalize clinicaltrials.gov and register research hypotheses before analysis

Stanley Young is Director of Bioinformatics at the National Institute for Statistical Sciences, and gave a talk in 2009 on problems in modern scientific research. For example: 1 in 20 NIH-funded studies actually replicates; closed data and opacity; model selection for significance; multiple comparisons.. Here is the link to his talk: Everything Is Dangerous: A Controversy. There are a number of good examples in the talk and Young anticipates and is more intellectually coherent than the New Yorker article The Truth Wears Off if you were interested in that.

Idea: Generalize clinicaltrials.gov, where scientists register their hypotheses prior to carrying out their experiment. Why not do this for all hypothesis tests? Have a site where the hypotheses are logged and time stamped before researchers gather the data or carry out the actual hypothesis testing for the project. I’ve heard this idea mentioned occasionally and both Young and Lehrer mentions it as well.

Smart disclosure of gov data

Imagine a cell phone database that includes terms of service, prices, fees, rates, different calling plans, quality of services, coverage maps etc. – “smart disclosure,” as the term is being used in federal government circles, means how to make data available such that it can be used and analyzed. Part of smart disclosure would mean collecting information from consumers as well, such as user experiences, bills, service complaints. This is the vision of the FCC’s chief of their Consumer and Governmental Affairs Bureau Joel Gurin at the Open Gov R&D Summit organized by the whitehouse. He notes that right away you run into issues of privacy and proprietary data that still need to be worked out.

He gives two examples of when it has worked: healthcare.gov – gov has collected and presented data but become the intermediary in presenting this data [I took a brief look at this site and don't see where to download data]. Another example is brightscope: they analyzed government released pension and 401(k) fees to create a ranking product they sell to hr managers so that folks can understand the appropriateness of the fees they pay.

The potential is enormous: imagine openness in FCC data. Gurin asks, how do we let many brightscopes bloom?

Christopher Meyer, vice president for external affairs and information services for the Consumers Union, gives an example of failure through database mismanagement. There was a spike in their dataset of consumer complaints about acceleration problems in toyota cars. They didn’t look at the data and didn’t notice this before Toyota issued the official recall. They’d like to do better, and have better organization in their data and better tools for issue detection through consumer complaints, with a mechanism to permit the manufacturer to respond early.