Here’s a vid of Keith Baggerly explaining his famous case study of why we need code and data to be made openly available in computational science: http://videolectures.net/cancerbioinformatics2010_baggerly_irrh. This is the work that resulted in the termination of clinical trials at Duke last November and the resignation of Anil Potti. Patients had been assigned into groups and actually given drugs before the trials were stopped. The story is shocking.
It’s also a good example of why traditional publishing doesn’t capture enough detail for reproducibility without the inclusion of data and code. Baggerly’s group at M.D. Anderson was able to make reproducing these results, what he has labeled “forensic biostatistics,” a priority and they spent an enormous amount of time doing this. We certainly need independent verification of results but to do so can often require knowledge of the methodology contained only in the code and data. In addition, Donoho et al (earlier version here) make the point that even when findings are independently replicated, open code and data is necessary to understand the reason for discrepancies in results. In a section in the paper listing and addressing objections we say:
Objection: True Reproducibility Means Reproducibility from First Principles.
Argument: It proves nothing if I point and click and see a bunch of numbers as expected. It only proves something if I start from scratch and build your system and in my implementation I get your results.
Response: If you exactly reproduce my results from scratch, that is quite an achievement! But it proves nothing if your implementation fails to give my results since we won’t know why. The only way we’d ever get to the bottom of such discrepancy is if we both worked reproducibly.
(ps. Audio and slides for a slightly shorter version of Baggerly’s talk here)