Monthly Archive for March, 2011

Smart disclosure of gov data

Imagine a cell phone database that includes terms of service, prices, fees, rates, different calling plans, quality of services, coverage maps etc. – “smart disclosure,” as the term is being used in federal government circles, means how to make data available such that it can be used and analyzed. Part of smart disclosure would mean collecting information from consumers as well, such as user experiences, bills, service complaints. This is the vision of the FCC’s chief of their Consumer and Governmental Affairs Bureau Joel Gurin at the Open Gov R&D Summit organized by the whitehouse. He notes that right away you run into issues of privacy and proprietary data that still need to be worked out.

He gives two examples of when it has worked: healthcare.gov – gov has collected and presented data but become the intermediary in presenting this data [I took a brief look at this site and don’t see where to download data]. Another example is brightscope: they analyzed government released pension and 401(k) fees to create a ranking product they sell to hr managers so that folks can understand the appropriateness of the fees they pay.

The potential is enormous: imagine openness in FCC data. Gurin asks, how do we let many brightscopes bloom?

Christopher Meyer, vice president for external affairs and information services for the Consumers Union, gives an example of failure through database mismanagement. There was a spike in their dataset of consumer complaints about acceleration problems in toyota cars. They didn’t look at the data and didn’t notice this before Toyota issued the official recall. They’d like to do better, and have better organization in their data and better tools for issue detection through consumer complaints, with a mechanism to permit the manufacturer to respond early.

Open Gov and the VA

Peter Levin is CTO of the Dept of Veteran’s Affairs and has a take on open gov tailored to his department: He’s restructuring the IT infrastructure within the VA to facilitate access. For example, the VA just processed their first paperless claim and is reducing claim turnaround time from 165 days to 40 days.

He is also focusing his efforts on emotional paths to engagement rather than numbers and figures. I hope they can provide both, but I see his comments as a reaction and criticism to open data in general. Levin gives the analogy of the introduction of the telephone – the phone was fundamentally social in nature and hence caught on beyond folks’ expectations, whereas a simply communicator of facts would not. That encapsulates his vision for tech changes at the VA.

James Hamilton of Northwestern suggests the best way to help reporting on government info and the communication of govt activities would be to improve the implementation of the Freedom of Information Act, in particular for journalists. The aim is to improve govt accountability. He also advocates machine learning techniques to automatically analyze comments and draw meaning from data in a variety of formats, like text analysis. He believes this software exists and is in use by the govt (even if that is true I am doubtful of how well it works) and an big improvement would be to make this software open source (he references Gary King’s software on text clustering too, which is open and has been repurposed by AP for example).

George Strawn from the National Coordination Office (NITRD) notes that there are big problem even combining data within agencies, let alone putting together datasets from disparate sources. He says in his experience agency directors aren’t getting the data they need, data that is theoretically available, to make their decisions.

Open Gov Summit: Aneesh Chopra

I’m here at the National Archives attending the Open Government Research and Development Summit, organized by the Office of Science and Technology Policy in the Whitehouse. It’s a series of panel discussions to address questions about the impact and future directions of Obama’s open gov initiative, in particular how to develop a deeper research agenda with the resulting gov data (see the schedule here).

Aneesh Chopra, our country’s first CTO, gave a framing talk in which he listed 5 questions he’d like to have answered through this workshop.

1. big data: how strengthen capacity to understand massive data?
2. new products: what constitutes high value data?
3. open platforms: what are the policy implications of enabling 3rd party apps?
4. international collaboration: what models translate to strengthen democracy internationally?
5. digital norms: what works and what doesn’t work in public engagement?

He hopes the rest of the workshop will not only address these questions and coalesce around recommendations. Chopra wants to be able to set innovation prizes to move towards solutions to these questions.

A case study in the need for open data and code: Forensic Bioinformatics

Here’s a vid of Keith Baggerly explaining his famous case study of why we need code and data to be made openly available in computational science: http://videolectures.net/cancerbioinformatics2010_baggerly_irrh. This is the work that resulted in the termination of clinical trials at Duke last November and the resignation of Anil Potti. Patients had been assigned into groups and actually given drugs before the trials were stopped. The story is shocking.

It’s also a good example of why traditional publishing doesn’t capture enough detail for reproducibility without the inclusion of data and code. Baggerly’s group at M.D. Anderson was able to make reproducing these results, what he has labeled “forensic biostatistics,” a priority and they spent an enormous amount of time doing this. We certainly need independent verification of results but to do so can often require knowledge of the methodology contained only in the code and data. In addition, Donoho et al (earlier version here) make the point that even when findings are independently replicated, open code and data is necessary to understand the reason for discrepancies in results. In a section in the paper listing and addressing objections we say:

Objection: True Reproducibility Means Reproducibility from First Principles.

Argument: It proves nothing if I point and click and see a bunch of numbers as expected. It only proves something if I start from scratch and build your system and in my implementation I get your results.

Response: If you exactly reproduce my results from scratch, that is quite an achievement! But it proves nothing if your implementation fails to give my results since we won’t know why. The only way we’d ever get to the bottom of such discrepancy is if we both worked reproducibly.

(ps. Audio and slides for a slightly shorter version of Baggerly’s talk here)

Open peer review of science: a possibility

The Nature journal Molecular Systems Biology published an editorial “From Bench to Website” explaining their move to a transparent system of peer review. Anonymous referee reports, editorial decisions, and author responses are published alongside the final published paper. When this exchange is published, care is taken to preserve anonymity of reviewers and to not disclose any unpublished results. Authors also have the ability to opt out and request their review information not be published at all.

Here’s an example of the commentary that is being published alongside the final journal article.

Their move follows on a similar decision taken by The EMBO Journal (European Molecular Biology Organization) as described in an editorial here where they state that the “transparent editorial process will make the process that led to acceptance of a paper accessible to all, as well as any discussion of merits and issues with the paper.” Their reasoning cites problems in the process of scientific communication and they give an example by Martin Raff which was published as a letter to the editor called “Painful Publishing” (behind a paywall, apologies). Raff laments the power of the anonymous reviewers to demand often unwarranted additional experimentation as a condition of publication: “authors are so keen to publish in these select journals that they are willing to carry out extra, time consuming experiments suggested by referees, even when the results could strengthen the conclusions only marginally. All too often, young scientists spend many months doing such ‘referees’ experiments.’ Their time and effort would frequently be better spent trying to move their project forward rather than sideways. There is also an inherent danger in doing experiments to obtain results that a referee demands to see.”

Rick Trebino, physics professor at Georgia Tech, penned a note detailing the often incredible steps he went through in trying to publish a scientific comment: “How to Publish a Scientific Comment in 1 2 3 Easy Steps.” It describes deep problems in our scientific discourse today. The recent clinical trials scandal at Duke University is another example of failed scientific communication. Many efforts were made to print correspondences regarding errors in published papers that may have permitted problems in the research to have been addressed earlier.

The editorial in Molecular Systems Biology also announces that the journal is joining many others in adopting a policy of encouraging the upload of the data that underlies results in the paper to be published alongside the final article. They go one step further and provide links from the figure in the paper to its underlying data. They give an example of such linked figures here. My question is how this dovetails with recent efforts by Donoho and Gavish to create a system of universal figure-level identifiers for published results, and the work of Altman and King to design Universal Numerical Fingerprints (UNFs) for data citation.

Chris Wiggins: Science is social

I had the pleasure of watching my friend and professor of applied physics and applied math Chris Wiggins give an excellent short talk at NYC’s social media week at Google. The video is available here: http://livestre.am/BUDx.

Chris makes the often forgotten point that science is inherently social. If discoveries aren’t publicly communicated, and hence added to our stock of knowledge, it isn’t science. He notes reproducibility as a manifestation of this openness in communication. (As another example of openness, Karl Popper suggested that if you’re interested in working in an international community, become a scientist.) Chris showcases many new web-based sharing tools and how they augment our fundamental norms, rather than changing them, hence his disagreement with the title of the session “Research Gone Social” in the sense that science has always been social.