The discussion around open access to published scientific results, the Open Access movement, is well known. The primary cause of the current situation — journal publishers owning copyright on journal articles and therefore charging for access — stems from authors signing their copyright over to the journals. I believe this happened because authors really didn’t realize what they were doing when they signed away ownership over their work, and had they known they would not have done so. I believe another solution would have been used, such as granting the journal a license to publish i.e. like Science’s readily available alternative license. At some level authors were entering into binding legal contracts without an understanding of the implications and without the right counsel.
I am seeing a similar situation arising with respect to data. It is not atypical for a data producing entity, particularly those in the commercial sphere, to require that researchers with access to the data sign a non-disclosure agreement. This seems to be standard for Facebook data, Elsevier data, and many many others. I’m witnessing researchers grabbing their pens and signing, and like in the publication context, feeling themselves powerless to do otherwise. Again, they are without the appropriate counsel. Even the general counsel’s office at their institution typically sees the GC’s role as protecting the institution against liability, rather than the larger concern of protecting the scholar’s work and the integrity of the scholarly record. What happens when research from these protected datasets is published, and questioned? How can others independently verify the findings? They’ll need access to the data.
There are many legitimate reasons such data may not be able to be publicly released, for example protection of subjects’ privacy (see what happened when Harvard released Facebook data from a study). But as scientists we should be mindful of the need for our published findings to be reproducible. Some commercial data do not come with privacy concerns, only concerns from the company that they are still able to sell the data to other commercial entities, and sometimes not even that. Sometimes lawyers simply want an NDA to minimize any risk to the commercial entity that might arise should the data be released. To me, that seems perfectly rational since they are not stewards of scientific knowledge.
It is also perfectly rational for authors publishing findings based on these data to push back as hard as possible to ensure maximum reproducibility and credibility of their results. Many companies share data with scientists because they seek to deepen goodwill and ties with the academic community, or they are interested in the results of the research. As researchers we should condition our acceptance of the data on its release when the findings are published, if there are no privacy concerns associated with the data. If there are privacy concerns I can imagine ensuring we can share the data in a “walled garden” within which other researchers, but not the public, will be able to access the data and verify results. There are a number of solutions that can bridge the gap between open access to data and an access-blocking NDA (e.g. differential privacy) and as scientists the integrity and reproducibility of our work is a core concern that we have responsibility for in this negotiation for data.
A few template data sharing agreements between academic researchers and data producing companies would be very helpful, if anyone feels like taking a crack at drafting them (Creative Commons?). Awareness of the issue is also important, among researchers, publishers, funders, and data producing entities. We cannot unthinkingly default to a legal situation regarding data that is anathema to scientific progress, as we did with access to scholarly publications.