Data access going the way of journal article access? Insist on open data

The discussion around open access to published scientific results, the Open Access movement, is well known. The primary cause of the current situation — journal publishers owning copyright on journal articles and therefore charging for access — stems from authors signing their copyright over to the journals. I believe this happened because authors really didn’t realize what they were doing when they signed away ownership over their work, and had they known they would not have done so. I believe another solution would have been used, such as granting the journal a license to publish i.e. like Science’s readily available alternative license. At some level authors were entering into binding legal contracts without an understanding of the implications and without the right counsel.

I am seeing a similar situation arising with respect to data. It is not atypical for a data producing entity, particularly those in the commercial sphere, to require that researchers with access to the data sign a non-disclosure agreement. This seems to be standard for Facebook data, Elsevier data, and many many others. I’m witnessing researchers grabbing their pens and signing, and like in the publication context, feeling themselves powerless to do otherwise. Again, they are without the appropriate counsel. Even the general counsel’s office at their institution typically sees the GC’s role as protecting the institution against liability, rather than the larger concern of protecting the scholar’s work and the integrity of the scholarly record. What happens when research from these protected datasets is published, and questioned? How can others independently verify the findings? They’ll need access to the data.

There are many legitimate reasons such data may not be able to be publicly released, for example protection of subjects’ privacy (see what happened when Harvard released Facebook data from a study). But as scientists we should be mindful of the need for our published findings to be reproducible. Some commercial data do not come with privacy concerns, only concerns from the company that they are still able to sell the data to other commercial entities, and sometimes not even that. Sometimes lawyers simply want an NDA to minimize any risk to the commercial entity that might arise should the data be released. To me, that seems perfectly rational since they are not stewards of scientific knowledge.

It is also perfectly rational for authors publishing findings based on these data to push back as hard as possible to ensure maximum reproducibility and credibility of their results. Many companies share data with scientists because they seek to deepen goodwill and ties with the academic community, or they are interested in the results of the research. As researchers we should condition our acceptance of the data on its release when the findings are published, if there are no privacy concerns associated with the data. If there are privacy concerns I can imagine ensuring we can share the data in a “walled garden” within which other researchers, but not the public, will be able to access the data and verify results. There are a number of solutions that can bridge the gap between open access to data and an access-blocking NDA (e.g. differential privacy) and as scientists the integrity and reproducibility of our work is a core concern that we have responsibility for in this negotiation for data.

A few template data sharing agreements between academic researchers and data producing companies would be very helpful, if anyone feels like taking a crack at drafting them (Creative Commons?). Awareness of the issue is also important, among researchers, publishers, funders, and data producing entities. We cannot unthinkingly default to a legal situation regarding data that is anathema to scientific progress, as we did with access to scholarly publications.

3 Responses to “Data access going the way of journal article access? Insist on open data”


  • Interesting article, but data holds no copyright. It’s potentially only governed by intellectual property agreements from those funding the research, I.e. research councils, universities, and commercial bodies (among others). Privacy and anonymity are key issues for anything relating to personal data, and are non-trivial to deal with, while data quality and equivalence are also pertinent issues that are often poorly understood but that make reuse challenging. Given that many researchers work hard to derive data that their academic output and reputations will then depend on creates a very difficult atmosphere to make it all open in one go – the threat to meta-researchers stealing the glory without getting their hands dirty is also a real issue. But this is a cultural challenge that must be faced, and it’s the ways of understanding all of this and managing the approach to data (of all colours and textures) that more and more researchers will be interested in. Remember though, it takes more than just data and machines to advance science!

  • Thanks for the thoughtful comment. Whether or not data are copyright researchers could benefit from sharing agreements that push as much as possible toward openness in support of scientific credibility and reproducibility. Agreed that credit and citation for data and code are important and severely lacking.

  • Simply lovely to see your talk today! This is an excellent summary of why open data matters and what it could look like.

    I’m surprised that there aren’t good template agreements already… you’ve inspired me to look for some. Of course in some fields even discoveries aren’t shared until the last possible minute, which was one of the first revolutions in science. So we’ve never had uniform adoption of sane practices across the board.

Leave a Reply