Archive for the 'Uncategorized' Category

Mistakes by Piketty are OK (even good?)

In an email conversation I tried to make some points about the criticism Piketty has come under for apparently having mistakes in his data. I think the concerns are real but misplaced. Here’s why:

There was a point made in the reproducibility session at DataEDGE this month by Fernando Perez that I think sums up my perspective on this pretty well: making good faith mistakes is human and honest (and ok), but the important point is that we need to be able to verify findings. Piketty seems to have made an enormous contribution (I haven’t read the book yet btw) by collating numerous disparate data sources, and making this data available. I think sometimes folks (like the Financial Times for example) have the idea that if the academy publishes something it is a FACT or a TRUTH – currently there seems to be a cognitive gap in understanding that research
publications are contributions to a larger conversation, one that hopes to narrow in on the truth. Feynman has a nice way of expressing this idea:

…as you develop more information in the sciences, it is not that you are finding out the truth, but that you are finding out that this or that is more or less likely.

That is, if we investigate further, we find that the statements of science are not of what is true and what is not true, but statements of what is known to different degrees of certainty: “It is very much more likely that so and so is true than that it is not true;” or “such and such is almost certain but there is still a little bit of doubt;” or – at the other extreme – “well, we really don’t know.” Every one of the concepts of science is on a scale graduated somewhere between, but at neither end of, absolute falsity or absolute truth.

It is necessary, I believe, to accept this idea, not only for science, but also for other things; it is of great value to acknowledge ignorance. It is a fact that when we make decisions in our life we don’t necessarily know that we are making them correctly; we only think that we are doing the best we can – and that is what we should do. [1]

I think viewing Piketty in that light makes his work a terrific contribution, and the fact that there are mistakes (of course there are mistakes) doesn’t detract from his contribution, but just means we have more work to do in understanding the data. This isn’t surprising for such a broad hypothesis as his, and it also isn’t surprising when you consider the complexity of his data collation and analysis. Any little tiny mistake, or even just a different decision, at any point along the line could change the outcome, as appears to be the case. It’s like waiting tables – if we sum up all the little ways a waiter or waitress could lose some tip, it would be easy to lose the entire tip! My hope is that the public discussion (and the scholarly discussion) moves toward an acceptance of mistakes and errors as a natural part of the process and contributes to minimizing them rather than attempting to discredit the scholarship completely. My advisor once wrote a short piece on being a highly cited author, and among other things he said to “leave room for improvement” when publishing since it is “absolutely crucial not to kill a field by doing too good a job in the first outing.” [2] In that light Piketty’s done a great job.

Of course all this changes if there was deliberate data manipulation or omission.

ps. I put together some views on Reinhart and Rogoff here, but imho it’s a red herring in the Piketty discussion, except insofar as both are examples that help flesh out standards and guidelines for data/code release in economics:
http://themonkeycage.org/2013/04/19/what-the-reinhart-rogoff-debacle-really-shows-verifying-empirical-results-needs-to-be-routine/

[1] http://calteches.library.caltech.edu/49/2/Religion.htm

[2] http://www.in-cites.com/scientists/DrDavidDonoho.html

What the Reinhart & Rogoff Debacle Really Shows: Verifying Empirical Results Needs to be Routine

There’s been an enormous amount of buzz since a study was released this week questioning the methodology in a published paper. The paper under fire is Reinhart and Rogoff’s “Growth in a Time of Debt” and the firing is being done by Herndon, Ash, and Pollin in their article “Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff.” Herndon, Ash, and Pollin claim to have found “spreadsheet errors, omission of available data, weighting, and transcription” in the original research which, when corrected, significantly reduce the magnitude of the original findings. These corrections were possible because of openness in economics, and this openness needs to be extended to make all computational publications reproducible.

How did this come about?
Continue reading ‘What the Reinhart & Rogoff Debacle Really Shows: Verifying Empirical Results Needs to be Routine’

Getting Beyond Marketing: Scan and Tell

I love this idea: http://nomoresopa.com/wp/. It’s an Android app that allows you to scan a product’s barcode and it will tell you whether the company that makes the product supports the Stop Online Piracy Act. What’s really happening here is the ability to get product information at the time of purchase decision. You could, for example, find out what a manufacturer’s parent companies are. Did you know that Cascadian Farm, makers of breakfast cereals and carried by Whole Foods, is owned by General Mills? This information is easily found on the General Mills website but not obvious when you’re looking at the box in the store. This kind of information can now be made readily available to consumers so they can make better and hopefully less biased choices. Love it.

Disrupt science: But not how you’d think

Two recent articles call for an openness revolution in science: one on GigaOM and the other in the Wall Street Journal. But they’ve got it all wrong. These folks are missing that the process of scientific discovery is not, at its core, an open process. It only becomes an open process at the point of publication.

I am not necessarily in favor of greater openness during the process of scientific collaboration. I am however necessarily in favor of openness in communication of the discoveries at the time of publication. Publication is the point at which authors feel their work is ready for public presentation and scrutiny (that traditional publication does not actually give the public access to new scientific knowledge is a tragedy and of course we should have Open Access). We even have a standard for the level of openness required at the time of scientific publication: reproducibility. This concept has been part of the scientific method since it was instigated by Robert Boyle in the 1660’s: communicate sufficient information such that another researcher in the field can replicate your work, given appropriate tools and equipment.

If we’ve already been doing this for hundreds of years what’s the big deal now? Leaving aside the question of whether or not published scientific findings have actually been reproducible over the last few hundred years (for work on this question see e.g. Changing Order: Replication and Induction in Scientific Practice by Harry Collins), science, like many other areas of modern life, is being transformed by the computer. It is now tough to find any empirical scientific research not touched by the computer in some way, from simply storing and analyzing records to the radical transformation of scientific inquiry through massive multicore simulations of physical systems (see e.g. The Dynamics of Plate Tectonics and Mantle Flow: From Local to Global Scales).

This, combined with the facilitation of digital communication we call the Internet, is an enormous opportunity for scientific advancement. Not because we can collaborate or share our work pre-publication as the articles assert – all of which we can do, if we like – but because computational research captures far more of the tacit knowledge involved in replicating a scientific experiment that ever before, making our science potentially more verifiable. The code and data another scientist believes replicates his or her experiments can capture all digital manipulations of the data, which now comprise much of the scientific discovery process. Commonly used techniques, such as the simulations carried out in papers included in sparselab.stanford.edu (see e.g “Breakdown Point of Model Selection when the Number of Variables Exceeds the Number of Observations” or any of the SparseLab papers), can now be replicated by downloading the short scripts we included with publication (see e.g. Reproducible Research in Computational Harmonic Analysis). This is a far cry from Boyle’s physical experiments with the air pump, and I’d argue one with enormously lower levels of tacit knowledge to communicate that we’re not capitalizing on today.

Computational experiments are complex. Without communicating the data and code that generated the result it is nearly impossible to understand what was done. My thesis advisor famously paraphrased this idea, “An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.”

It is simply wrong to assert that science has not adopted computational tools, as the GigoOM article does when it says “there is another world that makes these industries [traditional media players such as newspapers, magazines and book publishers] look like the most enthusiastic of early adopters [of technological progress]: namely, academic research.” Everywhere you look computation is emerging as central to the scientific method. What needs to happen is a resolution of the credibility crisis in computational science resulting from this enthusiastic technological adoption: we need to establish the routine inclusion of the data and code with publication, such that the results can be conveniently replicated and verified. Many new efforts to help data and code sharing are documented here, from workflow tracking to code portability to unique identifiers that permit the regeneration of results in the cloud.

This is an imperative issue, most of our published computational results today are unverified and unverifiable (I have written on this extensively, as have many others). The emphasis on the process of creative problem solving as the bottleneck in scientific discovery is misplaced. Scientific innovation is inherently creative – perhaps more collaborative tools that permit openness will encourage greater innovation. Or perhaps it is the case, as Paul Graham has pointed out in another context, that innovative thinking is largely a solitary act. What is clear is that it is of primary importance to establish publication practices that facilitate the replication and verification of results by including data and code, and not to confound these two issues.

Don’t expect computer scientists to be on top of every use that’s found for computers, including scientific investigation

Computational scientists need to understand and assert their computational needs, and see that they are met.

I just read this excellent interview with Donald Knuth, inventor of TeX and the concept of literature literate programming, as well as author of the famous textbook, The Art of Computer Programming. When asked for comments on (the lack of) software development using multicore processing, he says something very interesting – that multicore technology isn’t that useful, except in a few applications such as “rendering graphics, breaking codes, scanning images, simulating physical and biological processes, etc.” This caught my eye because parallel processing is a key advance for data processing. Statistical analysis of data typically executes line by line through the data, making it ideal for multithreaded applications. This isn’t some obscure part of science either – most science carried out today has some element of digital data processing (although of course not always at scales that warrant implementing parallel processing).

Knuth then says that “all these applications [that use parallel processing] require dedicated code and special-purpose techniques, which will need to be changed substantially every few years.” As the state of our scientific knowledge changes so does our problem solving ability, requiring modification of code used to generate scientific discovery. If I’m reading him correctly, Knuth seems to think this makes such applications less relevant to mainstream computer science.

The discussion reminded me of comments made at the “Workshop on Algorithms for Modern Massive Datasets” at Stanford in June 2010. Researchers in scientific computation (a specialized subdiscipline of computational science, see the Institute for Computational and Mathematical Engineering at Stanford or UT Austin’s Institute for Computational Engineering and Sciences for examples) were lamenting the direction computer hardware architecture was taking toward facilitating certain particular problems, such as particular techniques for matrix inversion and hot topics in linear algebra.

As scientific discovery transforms into a deeply computational process, we computational scientists must be prepared to partner with computer scientists to develop tools suited to the needs of scientific knowledge creation, or develop these skills ourselves. I’ve written elsewhere on the need to develop software that natively supports scientific ends (especially for workflow sharing; see e.g. http://stodden.net/AMP2011 ) and this applies to hardware as well.

Smart disclosure of gov data

Imagine a cell phone database that includes terms of service, prices, fees, rates, different calling plans, quality of services, coverage maps etc. – “smart disclosure,” as the term is being used in federal government circles, means how to make data available such that it can be used and analyzed. Part of smart disclosure would mean collecting information from consumers as well, such as user experiences, bills, service complaints. This is the vision of the FCC’s chief of their Consumer and Governmental Affairs Bureau Joel Gurin at the Open Gov R&D Summit organized by the whitehouse. He notes that right away you run into issues of privacy and proprietary data that still need to be worked out.

He gives two examples of when it has worked: healthcare.gov – gov has collected and presented data but become the intermediary in presenting this data [I took a brief look at this site and don’t see where to download data]. Another example is brightscope: they analyzed government released pension and 401(k) fees to create a ranking product they sell to hr managers so that folks can understand the appropriateness of the fees they pay.

The potential is enormous: imagine openness in FCC data. Gurin asks, how do we let many brightscopes bloom?

Christopher Meyer, vice president for external affairs and information services for the Consumers Union, gives an example of failure through database mismanagement. There was a spike in their dataset of consumer complaints about acceleration problems in toyota cars. They didn’t look at the data and didn’t notice this before Toyota issued the official recall. They’d like to do better, and have better organization in their data and better tools for issue detection through consumer complaints, with a mechanism to permit the manufacturer to respond early.

Open Gov and the VA

Peter Levin is CTO of the Dept of Veteran’s Affairs and has a take on open gov tailored to his department: He’s restructuring the IT infrastructure within the VA to facilitate access. For example, the VA just processed their first paperless claim and is reducing claim turnaround time from 165 days to 40 days.

He is also focusing his efforts on emotional paths to engagement rather than numbers and figures. I hope they can provide both, but I see his comments as a reaction and criticism to open data in general. Levin gives the analogy of the introduction of the telephone – the phone was fundamentally social in nature and hence caught on beyond folks’ expectations, whereas a simply communicator of facts would not. That encapsulates his vision for tech changes at the VA.

James Hamilton of Northwestern suggests the best way to help reporting on government info and the communication of govt activities would be to improve the implementation of the Freedom of Information Act, in particular for journalists. The aim is to improve govt accountability. He also advocates machine learning techniques to automatically analyze comments and draw meaning from data in a variety of formats, like text analysis. He believes this software exists and is in use by the govt (even if that is true I am doubtful of how well it works) and an big improvement would be to make this software open source (he references Gary King’s software on text clustering too, which is open and has been repurposed by AP for example).

George Strawn from the National Coordination Office (NITRD) notes that there are big problem even combining data within agencies, let alone putting together datasets from disparate sources. He says in his experience agency directors aren’t getting the data they need, data that is theoretically available, to make their decisions.

Open Gov Summit: Aneesh Chopra

I’m here at the National Archives attending the Open Government Research and Development Summit, organized by the Office of Science and Technology Policy in the Whitehouse. It’s a series of panel discussions to address questions about the impact and future directions of Obama’s open gov initiative, in particular how to develop a deeper research agenda with the resulting gov data (see the schedule here).

Aneesh Chopra, our country’s first CTO, gave a framing talk in which he listed 5 questions he’d like to have answered through this workshop.

1. big data: how strengthen capacity to understand massive data?
2. new products: what constitutes high value data?
3. open platforms: what are the policy implications of enabling 3rd party apps?
4. international collaboration: what models translate to strengthen democracy internationally?
5. digital norms: what works and what doesn’t work in public engagement?

He hopes the rest of the workshop will not only address these questions and coalesce around recommendations. Chopra wants to be able to set innovation prizes to move towards solutions to these questions.

A case study in the need for open data and code: Forensic Bioinformatics

Here’s a vid of Keith Baggerly explaining his famous case study of why we need code and data to be made openly available in computational science: http://videolectures.net/cancerbioinformatics2010_baggerly_irrh. This is the work that resulted in the termination of clinical trials at Duke last November and the resignation of Anil Potti. Patients had been assigned into groups and actually given drugs before the trials were stopped. The story is shocking.

It’s also a good example of why traditional publishing doesn’t capture enough detail for reproducibility without the inclusion of data and code. Baggerly’s group at M.D. Anderson was able to make reproducing these results, what he has labeled “forensic biostatistics,” a priority and they spent an enormous amount of time doing this. We certainly need independent verification of results but to do so can often require knowledge of the methodology contained only in the code and data. In addition, Donoho et al (earlier version here) make the point that even when findings are independently replicated, open code and data is necessary to understand the reason for discrepancies in results. In a section in the paper listing and addressing objections we say:

Objection: True Reproducibility Means Reproducibility from First Principles.

Argument: It proves nothing if I point and click and see a bunch of numbers as expected. It only proves something if I start from scratch and build your system and in my implementation I get your results.

Response: If you exactly reproduce my results from scratch, that is quite an achievement! But it proves nothing if your implementation fails to give my results since we won’t know why. The only way we’d ever get to the bottom of such discrepancy is if we both worked reproducibly.

(ps. Audio and slides for a slightly shorter version of Baggerly’s talk here)

Chris Wiggins: Science is social

I had the pleasure of watching my friend and professor of applied physics and applied math Chris Wiggins give an excellent short talk at NYC’s social media week at Google. The video is available here: http://livestre.am/BUDx.

Chris makes the often forgotten point that science is inherently social. If discoveries aren’t publicly communicated, and hence added to our stock of knowledge, it isn’t science. He notes reproducibility as a manifestation of this openness in communication. (As another example of openness, Karl Popper suggested that if you’re interested in working in an international community, become a scientist.) Chris showcases many new web-based sharing tools and how they augment our fundamental norms, rather than changing them, hence his disagreement with the title of the session “Research Gone Social” in the sense that science has always been social.

Ars technica article on reproducibility in science

John Timmer wrote an excellent article called “Keeping computers from ending science’s reproducibility.” I’m quoted in it. Here’s an excellent follow up blog post by Grant Jacobs, “Reproducible Research and computational biology.”

Post 3: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the OSTP’s call as posted here: http://www.ostp.gov/galleries/default-file/RFI%20Final%20for%20FR.pdf. The first wave, comments posted here, asked for feedback on implementation issues. The second wave requested input on Features and Technology (our post is here). For the third and final wave on Management, Chris Wiggins, Matt Knepley, and I posted the following comments:

Q1: Compliance. What features does a public access policy need to ensure compliance? Should this vary across agencies?

One size does not fit all research problems across all research communities, and a heavy-handed general release requirement across agencies could result in de jure compliance – release of data and code as per the letter of the law – without the extra effort necessary to create usable data and code facilitating reproducibility (and extension) of the results. One solution to this barrier would be to require grant applicants to formulate plans for release of the code and data generated through their research proposal, if funded. This creates a natural mechanism by which grantees (and peer reviewers), who best know their own research environments and community norms, contribute complete strategies for release. This would allow federal funding agencies to gather data on needs for release (repositories, further support, etc.); understand which research problem characteristics engender which particular solutions, which solutions are most appropriate in which settings, and uncover as-yet unrecognized problems particular researchers may encounter. These data would permit federal funding agencies to craft release requirements that are more sensitive to barriers researchers face and the demands of their particular research problems, and implement strategies for enforcement of these requirements. This approach also permits researchers to address confidentiality and privacy issues associated with their research.

Examples:

One exemplary precedent by a UK funding agency is the January 2007 “Policy on data management and sharing”
(http://www.wellcome.ac.uk/About-us/Policy/Policy-and-position-statements/WTX035043.htm)
adopted by The Wellcome Trust (http://www.wellcome.ac.uk/About-us/index.htm) according to which “the Trust will require that the applicants provide a data management and sharing plan as part of their application; and review these data management and sharing plans, including any costs involved in delivering them, as an integral part of the funding decision.” A comparable policy statement by US agencies would be quite useful in clarifying OSTP’s intent regarding the relationship between publicly-supported research and public access to the research products generated by this support.

Continue reading ‘Post 3: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government’

The OSTP's call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the OSTP’s call as posted here: http://www.ostp.gov/galleries/default-file/RFI%20Final%20for%20FR.pdf:

Open access to our body of federally funded research, including not only published papers but also any supporting data and code, is imperative, not just for scientific progress but for the integrity of the research itself. We list below nine focus areas and recommendations for action.

Continue reading ‘The OSTP's call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government’

The Climate Modeling Leak: Code and Data Generating Published Results Must be Open and Facilitate Reproducibility

On November 20 documents including email and code spanning more than a decade were leaked from the Climatic Research Unit (CRU) at East Anglia University in the UK.

The Leak Reveals a Failure of Reproducibility of Computational Results

It appears as though the leak came about through a long battle to get the CRU scientists to reveal the code and data associated with published results, and highlights a crack in the scientific method as practiced in computational science. Publishing standards have not yet adapted to the relatively new computational methods used pervasively across scientific research today.

Other branches of science have long-established methods to bring reproducibility into their practice. Deductive or mathematical results are published only with proofs, and there are long established standards for an acceptable proof. Empirical science contains clear mechanisms for communication of methods with the goal of facilitation of replication. Computational methods are a relatively new addition to a scientist’s toolkit, and the scientific community is only just establishing similar standards for verification and reproducibility in this new context. Peer review and journal publishing have generally not yet adapted to the use of computational methods and still operate as suitable for the deductive or empirical branches, creating a growing credibility gap in computational science.

Verifying Computational Results without Clear Communication of the Steps Taken is Near-Impossible

The frequent near-impossibility of verification of computational results when reproducibility is not considered a research goal is shown by the miserable travails of “Harry,” a CRU employee with access to their system who was trying to reproduce the temperature results. The leaked documents contain logs of his unsuccessful attempts. It seems reasonable to conclude that CRU’s published results aren’t reproducible if Harry, an insider, was unable to do so after four years.

This example also illustrates why a decision to leave reproducibility to others, beyond a cursory description of methods in the published text, is wholly inadequate for computational science. Harry seems to have had access to the data and code used and he couldn’t replicate the results. The merging and preprocessing of data in preparation for modeling and estimation encompasses a potentially very large number of steps, and a change in any one could produce different results. Just as when fitting models or running simulations, parameter settings and function invocation sequences must be communicated, again because the final results are a culmination of many decisions and without this information each small step must match the original work – a Herculean task. Responding with raw data when questioned about computational results is merely a canard, not intended to seriously facilitate reproducibility.

The story of Penn State professor of meteorology Michael Mann‘s famous hockey stick temperature time series estimates is an example where lack of verifiability had important consequences. Release of the code and data used to generate the results in the hockey stick paper likely would have avoided the convening of panels to assess the papers. The hockey stick is a dramatic illustration of global warming and became something of a logo for the U.N.’s Intergovernmental Panel of Climate Change (IPCC). Mann was an author of the 2001 IPCC Assessment report, and was a lead author on the “Copenhagen Diagnosis,” a report released Nov 24 and intended to synthesize the hundreds of research papers about human-induced climate change that have been published since the last assessment by the IPCC two years ago. The report was prepared in advance of the Copenhagen climate summit scheduled for Dec 7-18. Emails between CRU researchers and Mann are included in the leak, which happened right before the release of the Copenhagen Diagnosis (a quick search of the leaked emails for “Mann” provided 489 matches).

These reports are important in part because of their impact on policy, as CBS news reports, “In global warming circles, the CRU wields outsize influence: it claims the world’s largest temperature data set, and its work and mathematical models were incorporated into the United Nations Intergovernmental Panel on Climate Change’s 2007 report. That report, in turn, is what the Environmental Protection Agency acknowledged it “relies on most heavily” when concluding that carbon dioxide emissions endanger public health and should be regulated.”

Discussions of Appropriate Level of Code and Data Disclosure on RealClimate.org, Before and After the CRU Leak

For years researchers had requested the data and programs used to produce Mann’s Hockey Stick result, and were resisted. The repeated requests for code and data culminated in Freedom of Information (FOI) requests, in particular those made by Willis Eschenbach, who tells his story of requests he made for underlying code and data up until the time of the leak. It appears that a file, FOI2009.zip, was placed on CRU’s FTP server and then comments alerting people to its existence were posted on several key blogs.

The thinking regarding disclosure of code and data in one part of the climate change community is illustrated in this fascinating discussion on the blog RealClimate.org in February. (Thank you to Michael Nielsen for the pointer.) RealClimate.org has 5 primary authors, one of whom is Michael Mann, and its primary author is Gavin Schmidt. In this RealClimate blog post from November 27, Where’s the Data, the position seems to be now very much all in favor of data release, but the first comment asks for the steps taken in reconstructing the results as well. This is right – reproducibility of results should be the concern (as argued here for example).

Policy and Public Relations

The Hill‘s Blog Briefing Room reported that Senator Inhofe (R-Okla.) will investigate whether the IPCC “cooked the science to make this thing look as if the science was settled, when all the time of course we knew it was not.” With the current emphasis on evidence-based policy making, Inhofe’s review should recommend code and data release and require reliance on verified scientific results in policy making. The Federal Research Public Access Act should be modified to include reproducibility in publicly funded research.

A dangerous ramification from the leak could be an undermining of public confidence in science and the conduct of scientists. My sense is that making code and data readily available in a way that facilitates reproducibility of results, can help avoid distractions from the real science, such as potential evasions of FOIA requests, whether or not data were fudged, or scientists acted improperly in squelching dissent or manipulating journal editorial boards. Perhaps data release is becoming an accepted norm, but code release for reproducibility must follow. The issue here is verification and reproducibility, without which it is all but impossible to tell whether the core science done at CRU was correct or not, even for peer reviewing scientists.

Science 2.0: How Tools are Changing Computational Scientific Research

Technology has a history of sweeping scientific enterprise: from Vannevar Bush’s first analog PDE calculators at MIT in the 30’s through the differential analyzers of the 50’s and 60’s to today’s unfinished transition that will end with computation as absolutely central to scientific enterprise. Now computational tools play not only the traditional role of helping scientific discovery, but of facilitating it. On July 26 I’ll be talking about changes to the scientific method that computation has brought — does reproducibility matter? is computation creating a third branch of the scientific method? — at Science 2.0 in Toronto. The conference focuses on how the Internet is changing the process of doing science: how we share code and data, and how we use new communication technologies for collaboration and work tracking. Here’s the abstract for my talk and the URL:

How Computational Science is Changing the Scientific Method

As computation becomes more pervasive in scientific research, it seems to have become a mode of discovery in itself, a “third branch” of the scientific method. Greater computation also facilitates transparency in research through the unprecedented ease of communication of the associated code and data, but typically code and data are not made available and we are missing a crucial opportunity to control for error, the central motivation of the scientific method, through reproducibility. In this talk I explore these two changes to the scientific method and present possible ways to bring reproducibility into today’ scientific endeavor. I propose a licensing structure for all components of the research, called the “Reproducible Research Standard”, to align intellectual property law with longstanding communitarian scientific norms and encourage greater error control and verifiability in computational science.

http://softwarecarpentry.wordpress.com/guests/

Archives

Do not edit this page

About

I’m a Postdoctoral Associate in Law and Kauffman Fellow in Law and Innovation at the Information Society Project at Yale Law School. My website is http://www.stodden.net.

My research focus is changes to the scientific method arising from the pervasiveness of computation, specifically reproducibility in computational science.

The banner photograph is Istanbul at sunrise, and was taken by Sami Ben Gharbia.

The Scientific Method

OpinionJournal – Peggy Noonan

Peggy Noonan laments the inability of the scientific community to come together and deliver a solid answer on global warning. The reason why? The scientists have political agendas:

“You would think the world’s greatest scientists could do this, in good faith and with complete honesty and a rigorous desire to discover the truth. And yet they can’t. Because science too, like other great institutions, is poisoned by politics. Scientists have ideologies. They are politicized.”

I disagree. Certainly you can find purported scientists who are willing to subvert the truth to their agenda, in any field. But it is not enough to claim an agenda against the truth because the truth has not been discovered. You also need to establish that the questions we are asking, about global warming, are answerable with today’s knowledge, data, and technology. The answer the scientists might be forwarding is ‘I don’t know’ and there is nothing necessarily unscientific about that.

It seems to me what the scientific community has been saying is that the problem of global warming is phenomenally complex: the data are massive (many things to measure under all sorts of different circumstances) and future prediction has been near impossible. This is a scientific answer and possibly the best one we will have in the near term.

Graduate Student Unionization – a dead issue?

In spring quarter of last year I was quoted (without my permission or knowledge, incidentally) in an article on graduate student unionization in the Stanford Daily. http://daily.stanford.edu/tempo?page=content&id=17037&repository=0001_article. It’s not clear to me what the fuss is about, as a TA or RA at Stanford you are usually a graduate student with whatever benefits incur (such as health care or GSC negotiated pay raises). While more pay would always be nice, a terrific point is made by George Will in http://jewishworldreview.com/cols/will091605.asp: it will be difficult to extract benefits using striking if the services you provide aren’t essential to operations. My own sense as a TA is that our work could be picked up by professors or others in the department for a short term, likely enough to outlast a strike. Or a slightly less attentive course would be given to the students (this already happens as the number of TAs per course is not fixed and can vary by how many students are available from year to year). In fact this seems to have been the outcome from student strikes at Yale and Columbia.

So unless we are part of a larger university strike which includes essential services, I don’t think we’d have much traction. It’s also not clear to me the students would be overwhelmingly behind this – in academia much of our research and career is founded on cooperation and reputation, something students are often eager to demonstrate.

Google Earth – too much of a view?

http://www.worldtribune.com/worldtribune/05/front2453620.076388889.html

This article, “Google Earth images compromise secret installations in S. Korea” partly answers my first question when I found out about Google Earth. How are they handling sensitive satellite data?

Other countries have objected, out of national security concerns:
http://www.abc.net.au/news/indepth/featureitems/s1432602.htm
http://www.webpronews.com/insidesearch/insidesearch/wpn-56-20050811GoogleEarthContinuesToRaiseSecurityConcerns.html

The White House is censored already. Is the information in Google Earth really easily obtainable by other means as Google suggests?