The following comments were posted in response to the second wave of the OSTP’s call as posted here: http://www.ostp.gov/galleries/default-file/RFI%20Final%20for%20FR.pdf. The first wave, comments posted here and on the OSTP site here (scroll to the second last comment), asked for feedback on implementation issues. The second wave requests input on Features and Technology and Chris Wiggins and I posted the following comments:
We address each of the questions for phase two of OSTP’s forum on public access in turn. The answers generally depend on the community involved and (particularly question 7, asking for a cost estimate) on the scale of implementation. Inter-agency coordination is crucial however in (i) providing a centralized repository to access agency-funded research output and (ii) encouraging and/or providing a standardized tagging vocabulary and structure (as discussed further below).
Agency-funded research output will contain at least a peer-reviewed final paper, and if computational, should also contain data and code ensuring that the work is reproducible (the paper, code, and data together are described as the research “compendium”). It is imperative to provide public access to taxpayer-funded scientific output — not only to the final published paper but also the supporting data and code — for the reproducibility and skepticism fundamental to scientific communication and progress.
We address these eight questions in turn:
1. In what format should published papers be submitted in order to make them easy to find, retrieve, and search and to make it easy for others to link to them?
As a general rule publication formats and standards evolve over time as technologies develop and should not be mandated. Any development of research sharing platforms should take into account the evolving nature of standards and formats, and permit this innovation in an open community-driven way. Likely the easiest format for searching, at present, is that of XML; however, as this is not a publishing standard, a more reasonable intermediate goal is that of annotated PDFs and LaTeX comments which can easily be converted into XML given their rich use of structured environments (e.g., tables, figures, and citations). PDF is largely standard for scientific publications today, but is a proprietary format and should not be regulated as a standard. Proprietary formats, particularly those requiring purchase of specific commercial software, should strongly and unambiguously be discouraged by OSTP.
2. Are there existing digital standards for archiving and interoperability to maximize public benefit?
For manuscripts, there are at least two examples of widely-used standards for archiving. The first is the NIH’s use of PubMed and PubMedCentral. PubMed is a list of pointers with unique stable IDs (a.k.a. PMIDs) pointing to the peer-reviewed manuscript’s citation or, if available, online presence. The second serves as an archive of published, peer-reviewed manuscripts. PubMed couples both to the dynamics of publishing as well as funding, in that the final requirement the NIH makes of grant recipients is to use the PubMed Central identifier at the end of citations. The use of unique identifiers of papers, as well as of data and code, can encourage the release and hence citation of all forms of research. PubMed also assists in citation by exporting citations in several formats (though, unfortunately, not in BibTeX, the most widely-used format among quantitative and computational scientists). Such a unique identifier would also indicate compliance with agency open access policies.
The second example is http://arXiv.org, which originates from a different set of communities and is used purely for archiving; uploaded manuscripts need not ever be submitted for peer-review. ArXiv entries are given a unique “tag” pointing to the uploaded manuscript. After April 2007, the format was changed to a simple YYMM.NNNN, serving as a date-specific quantitative ID.
Not yet developed is a similar set of IDs for research compendia (defined above as the manuscript, code, and data required for reproducing the work). Tagging of research compendia is an important issue for communicating work, facilitating topical web searches, and aggregating a researcher’s contributions, including their data and code. Development of a standard RDFa vocabulary for HTML tags for agency funded research would enable search for data, code, and research as well as facilitating the transmission of licensing information, authorship, and sources. Enabling search by author would allow a more granular understanding of a researcher’s contributions, beyond citations. This would provide an incentive to release data and code, and give others — such as funders, award committees, and university hiring and promotion committees — access to a more representative assessment of the researcher’s contributions to the community than mere publication-counting. Such a tagging vocabulary could include unique identifiers for data and code, ideally the same as those required for repository deposit as discussed in the previous section, and thus facilitate and encourage their citation.
The leading efforts on these topics include and http://www.openarchives.org/ore/. The issue is not restricted to data however; for computational work the entire research compendium must be incorporated into the semantic structure. A recent talk by one of the authors on this issue, proposing HTML+RDFa tagging for research compendia, is available via http://www.stanford.edu/~vcs/talks/CCTechSummitVCS06262009.pdf.
3. How are these anticipated to change?
Technical challenges ahead will be set, as they have for the past decades, by growing sizes of the data files and code bases to be shared. The flexibility of XML (allowing future defined environment tags, for example) has so far kept up with the unpredictable changing demands of users. We anticipate such a mark-up language standard, which includes the possibility of defining new environments, the likely best option for moving forward.
The recent increase in research collaboration and virtual organizations suggests another possible pressure on standards. As scientific research becomes more highly tied to massive computation, for example the NSF’s TeraGrid computing infrastructure, research will tend to proceed through virtual environments allowing intensive collaboration by researchers separated geographically. The sharing of code and data in concurrent use is already happening, in addition to the downstream reuse of code and data by subsequent researchers. These virtual environments are developing standards for sharing that could exert pressure on the evolution of formats and protocols for code, data, and manuscript communication.
4. Are there formats that would be especially useful to researchers wishing to combine datasets or other published results published from various papers in order to conduct comparative studies or meta-analyses?
Formats should emerge from the researching communities (as was the case with the Protein Data Bank (PDB), at http://pdb.org), with encouragement toward HTML+RDFa standards for inclusion of meta-data. Careful consideration should be given to the locus of the digital archiving however. The creation of multiple, community-specific or agency-specific repositories does not facilitate interdisciplinary communication and thwarts scripted search and API usage; a national research repository should be established to house released agency funded manuscripts including supporting digital materials such and data and code, and provide links to research housed elsewhere. Many institutions do not have repositories, nor do they have the resources to maintain them. For computational work, supporting data and code must accompany article release creating additional demands on a repository. For papers whose results can be replicated from short scripts and small datasets, many computational scientists who do engage in reproducible research are able to host their research compendia (paper, data, and code) on their institutional web-pages or using hosting resources their institution is willing to provide. These individual contributions, however, may not conform to standardized formats that facilitate scripted search, and nor display transparent versioning and crucial time-stamping of edits and revisions, and may not be labeled with unique object identifiers as required by the NIH Open Access policy. These desiderata could be implemented in a straightforward manner by a neutral third-party site such as one coordinated among multiple funding agencies (as is the case with PDB). Not all computational research involves small amounts of supplemental data and code and an inter-agency repository could host very large datasets or complex bodies of code in cases where institutional support is not available to the researcher. Such a repository could extend the capabilities of http://arXiv.org or PubMed Central for all federally funded research (data, code, and peer-reviewed final manuscripts; perhaps renaming PubMed Central the more representative “PubSci” or “PubCentral”). A centralized repository is especially useful in encouraging researchers to combine datasets and/or code, as opposed to siloing the research by topic area.
5. What are the best examples of usability in the private sector (both domestic and international) and what makes them exceptional?
There are few in the private sector, in which there are often disincentives to transparency and interoperability. Successes at standardizing the maintaining and submission of code, for example, can be found in the private sector efforts at http://code.google.com, http://sourceforge.net, and http://github.com which are actively used by some academic researchers.
In the academic sector, notable examples to be emulated include http://arXiv.org (for manuscripts) and the Protein Data Base (http://pdb.org ; for protein structure data, one specific data type), which has worked since 1971 to solve the complexities of data sharing as well as the loosely-aligned interests of publishers, scientists, and funding agencies. There are many successful examples of data sharing in academic communities, such as Gary King’s Social Science research repository at Harvard, http://TheData.org, or Pat Brown’s Stanford MicroArray Database at http://smd.stanford.edu. Note that the MicroArray community publishes their data with every publication as a routinely accepted requirement; similar standards have been enforced in protein structure since the 1990s (cf. http://www.nature.com/nsmb/wilma/v5n3.892130820.html).
Since the data and code are being shared and reused, licensing agreements in these repositories come to the fore. This is an open and active problem across academia largely with the goal of securing attribution rights for owners while permitting use and reuse by others, while minimizing or eliminating licensing incompatibilities between different datasets. Licenses must be compatible for different datasets or different programs to be combined.
6. Should those who access papers be given the opportunity to comment or provide feedback?
Online submission is clearly advantageous for the open and democratic sharing of opinion. However, given the very real consequences (including to future funding, careers, and, in the case of such fields as climate and medicine, policy and political decisions), feedback should be moderated, restricted to verified email addresses, and provided via unique IPs.
7. What are the anticipated costs of maintaining publicly accessible libraries of available papers, and how might various public access business models affect these maintenance costs?
Memory and disk space get cheaper with each year, but such a site requires staffing. The answer to this question, however, depends entirely on the scale of the implementation. What is important to note is the principle of Open Access, and such libraries should be considered valuable stewards of our culture just as the Library of Congress and the National Archives.
8. By what metrics (e.g. number of articles or visitors) should the Federal government measure success of its public access collections?
As mentioned above, the principle of Open Access recognizes that such collections should be considered valuable stewards of our culture just as the Library of Congress and the National Archives. Rewards to the availability of scientific compendia — papers, data, and code — come not only through views and downloads, but through the acceleration of scientific research, technological development, and an increase in scientific integrity.
Yale Law School, New Haven, CT
Science Commons, Cambridge, MA
Department of Applied Physics and Applied Mathematics, Columbia University, New York, NY
References These issues were discussed at a roundtable convened by one of the authors on research sharing issues held at Yale Law School on November 21, 2009. The webpage, along with thought pieces and research materials, is located at http://www.stanford.edu/~vcs/Conferences/RoundtableNov212209/.