<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Victoria Stodden &#187; Software</title>
	<atom:link href="http://blog.stodden.net/category/software/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.stodden.net</link>
	<description>Just another WordPress weblog</description>
	<lastBuildDate>Sun, 16 May 2010 02:13:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Open Data Dead on Arrival</title>
		<link>http://blog.stodden.net/2010/02/03/open-data-dead-on-arrival/</link>
		<comments>http://blog.stodden.net/2010/02/03/open-data-dead-on-arrival/#comments</comments>
		<pubDate>Wed, 03 Feb 2010 04:17:27 +0000</pubDate>
		<dc:creator>vcs</dc:creator>
				<category><![CDATA[Developing world]]></category>
		<category><![CDATA[Intellectual Property]]></category>
		<category><![CDATA[Open Science]]></category>
		<category><![CDATA[Reproducible Research]]></category>
		<category><![CDATA[Scientific Method]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://blog.stodden.net/?p=174</guid>
		<description><![CDATA[In 1984 Karl Popper wrote a private letter to an inquirer he didn&#8217;t know, responding to enclosed interview questions. The response was subsequently published and in it he wrote, among other things, that: &#8220;Every intellectual has a very special responsibility. He has the privilege and opportunity of studying. In return, he owes it to his [...]]]></description>
			<content:encoded><![CDATA[<p>In 1984 Karl Popper wrote a private letter to an inquirer he didn&#8217;t know, responding to enclosed interview questions. The response was subsequently published and in it he wrote, among other things, that:</p>
<blockquote><p>
&#8220;Every intellectual has a very special responsibility. He has the privilege and opportunity of studying. In return, he owes it to his fellow men (or &#8216;to society&#8217;) to represent the results of his study as simply, clearly and modestly as he can. The worst thing that intellectuals can do &#8212; the cardinal sin &#8212; is to try to set themselves up as great prophets vis-a-vis their fellow men and to impress them with puzzling philosophies. Anyone who cannot speak simply and clearly should say nothing and continue to work until he can do so.&#8221;
</p></blockquote>
<p>Aside from the offensive sexism in referring to intellectuals as males, there is another way this imperative should be updated for intellectualism today. The movement to make data available online is picking up momentum &#8212; as it should &#8212; and open code is following suit (see <a href=http://mloss.org/>http://mloss.org</a> for example). But data should not be confused with facts, and applying the simple communication that Popper refers to beyond the written or spoken word is the only way open data will produce dividends. It isn&#8217;t enough to post raw data, or undocumented code. Data and code should be considered part of intellectual communication, and made as simple as possible for &#8220;fellow men&#8221; to understand. Just as knowledge of adequate English vocabulary is assumed in the nonquantitative communication Popper refers to, certain basic coding and data knowledge can be assumed as well. This means the same thing as it does in the literary case; the elimination of extraneous information and obfuscating terminology. No need to bury interested parties in an Enron-like shower of bits. It also means using a format for digital communication that is conducive to reuse, such as a flat text file or another non-proprietary format, for example pdf files cannot be considered acceptable to either data or code. Facilitating reproducibility must be the gold standard for data and code release.</p>
<h4>And who are these &#8220;fellow men&#8221;?</h4>
<p>Well, fellow men and women that is, but back to the issue. Much of the history of scientific communication has dealt with the question of demarcation of the appropriate group to whom the reasoning behind the findings would be communicated, the definition of the scientific community. Clearly, communication of very technical and specialized results to a layman would take intellectuals&#8217; time away from doing what they do best, being intellectual. On the other hand some investment in explanation is essential for establishing a finding as an accepted fact &#8212; assuring others that sufficient error has been controlled for and eliminated in the process of scientific discovery. These others ought to be able to verify results, find mistakes, and hopefully build on the results (or the gaps in the theory) and thereby further our understanding. So there is a tradeoff. Hence the establishment of the Royal Society for example as a body with the primary purpose of discussing scientific experiments and results. Couple this with Newton&#8217;s surprise, or even irritation, at having to explain results he put forth to the Society in his one and only journal publication in their journal Philosophical Transactions (he called the various clarifications tedious, and sought to withdraw from the Royal Society and subsequently never published another journal paper. See the last chapter of <a href=http://mitpress.mit.edu/catalog/item/default.asp?tid=10611&#038;ttype=2>The Access Principle</a>). <b>There is a mini-revolution underfoot that has escaped the spotlight of attention on open data, open code, and open scientific literature. That is, the fact that the intent is to open to the public.</b> Not open to peers, or appropriately vetted scientists, or selected ivory tower mates, but to anyone. Never before has the standard for communication been &#8220;everyone,&#8221; in fact quite the opposite. Efforts had traditionally been expended narrowing and selecting the community privileged enough to participate in scientific discourse.</p>
<h4>So what does public openness mean for science?</h4>
<p>Recall the leaked files from the University of East Anglia&#8217;s Climatic Research Unit last November. Much of the information revealed concerned scientifically suspect (and ethically dubious) attempts not to reveal data and methods underlying published results. Although that tack seems to have <a href=http://www.realclimate.org/index.php/archives/2009/12/please-show-us-your-code/>softened now</a> some initial responses defended the climate scientists&#8217; right to be closed with regard to their methods due to the possibility of &#8220;<a href=http://sgillies.net/blog/970/ddos-on-climate-science/>denial of service attacks</a>&#8221; &#8211; the ripping apart of methodology (recall all science is wrong, an asymptotic progression toward to truth at best) not with the intent of finding meaningful errors that halt the acceptance of findings as facts, but merely to tie up the climate scientists so they cannot attend to real research. This is the same tradeoff as described above. An interpretation of this situation cannot be made without the complicating realization that peer review &#8212; the review process that vets articles for publication &#8212; doesn&#8217;t check computational results but largely operates as if the papers are expounding results from the pre-computational scientific age. The outcome, if computational methodologies are able to remain closed from view, is that they are directly vetted nowhere. Hardly an acceptable basis for establishing facts. My own view is that data and code must be communicated publicly with attention paid to Popper&#8217;s admonition: as simply and clearly as possible, such that the results can be replicated. Not participating in dialog with those insufficiently knowledgable to engage will become part of our scientific norms, in fact this is enshrined in the structure of our scientific societies of old. Others can take up those ends of the discussion, on blogs, in digital forums. But public openness is important not just because taxpayers have a right to what they paid for (perhaps they do, but this quickly falls apart since not all the public are technically taxpayers and that seems a wholly unjust way of deciding who shall have access to scientific knowledge and who not, clearly we mean society), but because of the increasing inclusiveness of the scientific endeavor. How do we determine who is qualified to find errors in our scientific work? We don&#8217;t. Real problems will get noticed regardless of with whom they originate, many eyes making all bugs shallow. And I expect peer review for journal publishing to incorporate computational evaluation as well.</p>
<h4>Where does this leave all the open data?</h4>
<p>Unused, unless efforts are expended to communicate the meaning of the data, and to maximize the usability of the code. Data is not synonymous with facts &#8211; methods for understanding data, and turning its contents into facts, are embedded within the documentation and code. Take for granted that users understand the coding language or basic scientific computing functions, but clearly and modestly explain the novel contributions. Facilitate reproducibility. Without this data may be open, but will remain de facto in the ivory tower.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stodden.net/2010/02/03/open-data-dead-on-arrival/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Ars technica article on reproducibility in science</title>
		<link>http://blog.stodden.net/2010/01/26/ars-technica-article-on-reproducibility-in-science/</link>
		<comments>http://blog.stodden.net/2010/01/26/ars-technica-article-on-reproducibility-in-science/#comments</comments>
		<pubDate>Tue, 26 Jan 2010 19:58:27 +0000</pubDate>
		<dc:creator>vcs</dc:creator>
				<category><![CDATA[Open Science]]></category>
		<category><![CDATA[Reproducible Research]]></category>
		<category><![CDATA[Scientific Method]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.stodden.net/2010/01/26/ars-technica-article-on-reproducibility-in-science/</guid>
		<description><![CDATA[John Timmer wrote an excellent article called &#8220;Keeping computers from ending science&#8217;s reproducibility.&#8221; I&#8217;m quoted in it. Here&#8217;s an excellent follow up blog post by Grant Jacobs, &#8220;Reproducible Research and computational biology.&#8221;]]></description>
			<content:encoded><![CDATA[<p>John Timmer wrote an excellent article called &#8220;<a href=http://arstechnica.com/science/news/2010/01/keeping-computers-from-ending-sciences-reproducibility.ars>Keeping computers from ending science&#8217;s reproducibility</a>.&#8221; I&#8217;m quoted in it. Here&#8217;s an excellent follow up blog post by Grant Jacobs, &#8220;<a href=http://sciblogs.co.nz/code-for-life/2010/01/24/reproducible-research-and-computational-biology/>Reproducible Research and computational biology</a>.&#8221;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stodden.net/2010/01/26/ars-technica-article-on-reproducibility-in-science/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Code Repository for Machine Learning: mloss.org</title>
		<link>http://blog.stodden.net/2010/01/26/code-repository-for-machine-learning/</link>
		<comments>http://blog.stodden.net/2010/01/26/code-repository-for-machine-learning/#comments</comments>
		<pubDate>Tue, 26 Jan 2010 18:25:37 +0000</pubDate>
		<dc:creator>vcs</dc:creator>
				<category><![CDATA[Conferences]]></category>
		<category><![CDATA[Open Science]]></category>
		<category><![CDATA[Reproducible Research]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://blog.stodden.net/?p=168</guid>
		<description><![CDATA[The folks at mloss.org &#8212; Machine Leaning Open Source Software &#8212; invited a blog post on my roundtable on data and code sharing, held at Yale Law School last November. mloss.org&#8217;s philosophy is stated as: &#8220;Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At [...]]]></description>
			<content:encoded><![CDATA[<p>The folks at mloss.org &#8212; Machine Leaning Open Source Software &#8212; invited a blog post on my roundtable on data and code sharing, held at Yale Law School last November. mloss.org&#8217;s <a href=http://mloss.org/about/>philosophy</a> is stated as:</p>
<p>&#8220;Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for a wide range of applications. Inspired by similar efforts in bioinformatics (BOSC) or statistics (useR), our aim is to build a forum for open source software in machine learning.&#8221;</p>
<p>The site is excellent and worth a visit. The guest blog <a href=http://www.columbia.edu/~chw2>Chris Wiggins</a> and I wrote starts:</p>
<p>&#8220;As pointed out by the authors of the mloss position paper [1] in 2007, &#8220;reproducibility of experimental results is a cornerstone of science.&#8221; Just as in machine learning, researchers in many computational fields (or in which computation has only recently played a major role) are struggling to reconcile our expectation of reproducibility in science with the reality of ever-growing computational complexity and opacity. [2-12]</p>
<p>In an effort to address these questions from researchers not only from statistical science but from a variety of disciplines, and to discuss possible solutions with representatives from publishing, funding, and legal scholars expert in appropriate licensing for open access, <a href=http://www.law.yale.edu/intellectuallife/informationsocietyproject.htm>Yale Information Society Project</a> Fellow Victoria Stodden convened a <a href=http://www.stanford.edu/~vcs/Conferences/RoundtableNov212009/>roundtable</a> on the topic on November 21, 2009. Attendees included statistical scientists such as <a hrefhttp://gentleman.fhcrc.org/>Robert Gentleman</a> (co-developer of R) and <a href=http://www-stat.stanford.edu/~donoho>David Donoho</a>, among others.&#8221;</p>
<p>keep reading at <a href=http://mloss.org/community/blog/2010/jan/26/data-and-code-sharing-roundtable/>http://mloss.org/community/blog/2010/jan/26/data-and-code-sharing-roundtable/</a>. We made an effort to reference efforts in other fields regarding reproducibility in computational science.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stodden.net/2010/01/26/code-repository-for-machine-learning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Post 2: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government</title>
		<link>http://blog.stodden.net/2009/12/28/post-2-the-ostp%e2%80%99s-call-for-comments-regarding-public-access-policies-for-science-and-technology-funding-agencies-across-the-federal-government/</link>
		<comments>http://blog.stodden.net/2009/12/28/post-2-the-ostp%e2%80%99s-call-for-comments-regarding-public-access-policies-for-science-and-technology-funding-agencies-across-the-federal-government/#comments</comments>
		<pubDate>Tue, 29 Dec 2009 02:54:38 +0000</pubDate>
		<dc:creator>vcs</dc:creator>
				<category><![CDATA[Intellectual Property]]></category>
		<category><![CDATA[Law]]></category>
		<category><![CDATA[OSTP]]></category>
		<category><![CDATA[Open Science]]></category>
		<category><![CDATA[Reproducible Research]]></category>
		<category><![CDATA[Scientific Method]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://blog.stodden.net/?p=144</guid>
		<description><![CDATA[The following comments were posted in response to the second wave of the OSTP&#8217;s call as posted here: http://www.ostp.gov/galleries/default-file/RFI%20Final%20for%20FR.pdf. The first wave, comments posted here and on the OSTP site here (scroll to the second last comment), asked for feedback on implementation issues. The second wave requests input on Features and Technology and Chris Wiggins [...]]]></description>
			<content:encoded><![CDATA[<p>The following comments were posted in response to the second wave of the OSTP&#8217;s call as posted here: <a href=http://www.ostp.gov/galleries/default-file/RFI%20Final%20for%20FR.pdf>http://www.ostp.gov/galleries/default-file/RFI%20Final%20for%20FR.pdf</a>. The first wave, comments posted <a href=http://blog.stodden.net/2009/12/21/the-ostps-call-for-comments-regarding-public-access-policies-for-science-and-technology-funding-agencies-across-the-federal-government>here</a> and on the OSTP site <a href=http://blog.ostp.gov/2009/12/10/policy-forum-on-public-access-to-federally-funded-research-implementation>here</a> (scroll to the second last comment), asked for feedback on implementation issues. The <a href=http://blog.ostp.gov/2009/12/21/policy-forum-on-public-access-to-federally-funded-research-features-and-technology>second wave requests input on Features and Technology</a> and Chris Wiggins and I posted the following comments:</p>
<p>We address each of the questions for phase two of OSTP&#8217;s forum on public access in turn. The answers generally depend on the community involved and (particularly question 7, asking for a cost estimate) on the scale of implementation. Inter-agency coordination is crucial however in (i) providing a centralized repository to access agency-funded research output and (ii) encouraging and/or providing a standardized tagging vocabulary and structure (as discussed further below).</p>
<p><span id="more-144"></span><br />
Agency-funded research output will contain at least a peer-reviewed final paper, and if computational, should also contain data and code ensuring that the work is reproducible (the paper, code, and data together are described as the research &#8220;compendium&#8221;). It is imperative to provide public access to taxpayer-funded scientific output &#8212; not only to the final published paper but also the supporting data and code &#8212; for the reproducibility and skepticism fundamental to scientific communication and progress.</p>
<p>
<p>
We address these eight questions in turn:</p>
<p>
<p>
<i>1. In what format should published papers be submitted in order to make them easy to find, retrieve, and search and to make it easy for others to link to them?</i></p>
<p>
<p>
As a general rule publication formats and standards evolve over time as technologies develop and should not be mandated. Any development of research sharing platforms should take into account the evolving nature of standards and formats, and permit this innovation in an open community-driven way. Likely the easiest format for searching, at present, is that of XML; however, as this is not a publishing standard, a more reasonable intermediate goal is that of annotated PDFs and LaTeX comments which can easily be converted into XML given their rich use of structured environments (e.g., tables, figures, and citations). PDF is largely standard for scientific publications today, but is a proprietary format and should not be regulated as a standard. Proprietary formats, particularly those requiring purchase of specific commercial software, should strongly and unambiguously be discouraged by OSTP.</p>
<p>
<p>
<i>2. Are there existing digital standards for archiving and interoperability to maximize public benefit?</i></p>
<p>
<p>
For manuscripts, there are at least two examples of widely-used standards for archiving. The first is the NIH&#8217;s use of PubMed and PubMedCentral. PubMed is a list of pointers with unique stable IDs (a.k.a. PMIDs) pointing to the peer-reviewed manuscript&#8217;s citation or, if available, online presence. The second serves as an archive of published, peer-reviewed manuscripts. PubMed couples both to the dynamics of publishing as well as funding, in that the final requirement the NIH makes of grant recipients is to use the PubMed Central identifier at the end of citations. The use of unique identifiers of papers, as well as of data and code, can encourage the release and hence citation of all forms of research. PubMed also assists in citation by exporting citations in several formats (though, unfortunately, not in BibTeX, the most widely-used format among quantitative and computational scientists). Such a unique identifier would also indicate compliance with agency open access policies.</p>
<p>
<p>
The second example is <a href=http://arXiv.org>http://arXiv.org</a>, which originates from a different set of communities and is used purely for archiving; uploaded manuscripts need not ever be submitted for peer-review. ArXiv entries are given a unique &#8220;tag&#8221; pointing to the uploaded manuscript. After April 2007, the format was changed to a simple YYMM.NNNN, serving as a date-specific quantitative ID.</p>
<p>
<p>
Not yet developed is a similar set of IDs for research compendia (defined above as the manuscript, code, and data required for reproducing the work). Tagging of research compendia is an important issue for communicating work, facilitating topical web searches, and aggregating a researcher&#8217;s contributions, including their data and code. Development of a standard RDFa vocabulary for HTML tags for agency funded research would enable search for data, code, and research as well as facilitating the transmission of licensing information, authorship, and sources. Enabling search by author would allow a more granular understanding of a researcher&#8217;s contributions, beyond citations. This would provide an incentive to release data and code, and give others &#8212; such as funders, award committees, and university hiring and promotion committees &#8212; access to a more representative assessment of the researcher&#8217;s contributions to the community than mere publication-counting.  Such a tagging vocabulary could include unique identifiers for data and code, ideally the same as those required for repository deposit as discussed in the previous section, and thus facilitate and encourage their citation.</p>
<p>
<p>
The leading efforts on these topics include <a href=http://www.datacite.org/<http://www.datacite.org/</a> and <a href=http://www.openarchives.org/ore>http://www.openarchives.org/ore/</a>. The issue is not restricted to data however; for computational work the entire research compendium must be incorporated into the semantic structure. A recent talk by one of the authors on this issue, proposing HTML+RDFa tagging for research compendia, is available via <a href=http://www.stanford.edu/~vcs/talks/CCTechSummitVCS06262009.pdf>http://www.stanford.edu/~vcs/talks/CCTechSummitVCS06262009.pdf</a>.</p>
<p>
<p>
<i>3. How are these anticipated to change?</i></p>
<p>
<p>
Technical challenges ahead will be set, as they have for the past decades, by growing sizes of the data files and code bases to be shared. The flexibility of XML (allowing future defined environment tags, for example) has so far kept up with the unpredictable changing demands of users. We anticipate such a mark-up language standard, which includes the possibility of defining new environments, the likely best option for moving forward.</p>
<p>
<p>
The recent increase in research collaboration and virtual organizations suggests another possible pressure on standards. As scientific research becomes more highly tied to massive computation, for example the NSF&#8217;s <a href=http://www.teragrid.org>TeraGrid</a> computing infrastructure, research will tend to proceed through virtual environments allowing intensive collaboration by researchers separated geographically. The sharing of code and data in concurrent use is already happening, in addition to the downstream reuse of code and data by subsequent researchers. These virtual environments are developing standards for sharing that could exert pressure on the evolution of formats and protocols for code, data, and manuscript communication.</p>
<p>
<p>
<i>4. Are there formats that would be especially useful to researchers wishing to combine datasets or other published results published from various papers in order to conduct comparative studies or meta-analyses?</i></p>
<p>
<p>
Formats should emerge from the researching communities (as was the case with the Protein Data Bank (PDB), at <a href=http://pdb.org>http://pdb.org</a>), with encouragement toward HTML+RDFa standards for inclusion of meta-data. Careful consideration should be given to the locus of the digital archiving however. The creation of multiple, community-specific or agency-specific repositories does not facilitate interdisciplinary communication and thwarts scripted search and API usage; a national research repository should be established to house released agency funded manuscripts including supporting digital materials such and data and code, and provide links to research housed elsewhere. Many institutions do not have repositories, nor do they have the resources to maintain them. For computational work, supporting data and code must accompany article release creating additional demands on a repository. For papers whose results can be replicated from short scripts and small datasets, many computational scientists who do engage in reproducible research are able to host their research compendia (paper, data, and code) on their institutional web-pages or using hosting resources their institution is willing to provide. These individual contributions, however, may not conform to standardized formats that facilitate scripted search, and nor display transparent versioning and crucial time-stamping of edits and revisions, and may not be labeled with unique object identifiers as required by the NIH Open Access policy. These desiderata could be implemented in a straightforward manner by a neutral third-party site such as one coordinated among multiple funding agencies (as is the case with PDB). Not all computational research involves small amounts of supplemental data and code and an inter-agency repository could host very large datasets or complex bodies of code in cases where institutional support is not available to the researcher. Such a repository could extend the capabilities of http://arXiv.org or PubMed Central for all federally funded research (data, code, and peer-reviewed final manuscripts; perhaps renaming PubMed Central the more representative &#8220;PubSci&#8221; or &#8220;PubCentral&#8221;). A centralized repository is especially useful in encouraging researchers to combine datasets and/or code, as opposed to siloing the research by topic area.</p>
<p>
<p>
<i>5. What are the best examples of usability in the private sector (both domestic and international) and what makes them exceptional?</i></p>
<p>
<p>
There are few in the private sector, in which there are often disincentives to transparency and interoperability. Successes at standardizing the maintaining and submission of code, for example, can be found in the private sector efforts at <a href=http://code.google.com>http://code.google.com</a>, <a href=http://sourceforge.net>http://sourceforge.net</a>, and <a href=http://github.com>http://github.com</a> which are actively used by some academic researchers.</p>
<p>
<p>
In the academic sector, notable examples to be emulated include <a href=http://arXiv.org>http://arXiv.org</a> (for manuscripts) and the Protein Data Base (<a href=http://pdb.org>http://pdb.org</a> ; for protein structure data, one specific data type), which has worked since 1971 to solve the complexities of data sharing as well as the loosely-aligned interests of publishers, scientists, and funding agencies. There are many successful examples of data sharing in academic communities, such as Gary King&#8217;s Social Science research repository at Harvard, <a href=http://TheData.org>http://TheData.org</a>, or Pat Brown&#8217;s Stanford MicroArray Database at <a href=http://smd.stanford.edu>http://smd.stanford.edu</a>. Note that the MicroArray community publishes their data with every publication as a routinely accepted requirement; similar standards have been enforced in protein structure since the 1990s (cf. <a href=http://www.nature.com/nsmb/wilma/v5n3.892130820.html>http://www.nature.com/nsmb/wilma/v5n3.892130820.html</a>).</p>
<p>
<p>
Since the data and code are being shared and reused, licensing agreements in these repositories come to the fore. This is an open and active problem across academia largely with the goal of securing attribution rights for owners while permitting use and reuse by others, while minimizing or eliminating licensing incompatibilities between different datasets. Licenses must be compatible for different datasets or different programs to be combined.</p>
<p>
<p>
<i>6. Should those who access papers be given the opportunity to comment or provide feedback?</i></p>
<p>
<p>
Online submission is clearly advantageous for the open and democratic sharing of opinion. However, given the very real consequences (including to future funding, careers, and, in the case of such fields as climate and medicine, policy and political decisions), feedback should be moderated, restricted to verified email addresses, and provided via unique IPs.</p>
<p>
<p>
<i>7. What are the anticipated costs of maintaining publicly accessible libraries of available papers, and how might various public access business models affect these maintenance costs?</i></p>
<p>
<p>
Memory and disk space get cheaper with each year, but such a site requires staffing. The answer to this question, however, depends entirely on the scale of the implementation. What is important to note is the principle of Open Access, and such libraries should be considered valuable stewards of our culture just as the Library of Congress and the National Archives.</p>
<p>
<p>
<i>8. By what metrics (e.g. number of articles or visitors) should the Federal government measure success of its public access collections?</i></p>
<p>
<p>
As mentioned above, the principle of Open Access recognizes that such collections should be considered valuable stewards of our culture just as the Library of Congress and the National Archives. Rewards to the availability of scientific compendia &#8212; papers, data, and code &#8212; come not only through views and downloads, but through the acceleration of scientific research, technological development, and an increase in scientific integrity.</p>
<p>
<p>
Victoria Stodden<br />
Yale Law School, New Haven, CT<br />
Science Commons, Cambridge, MA</p>
<p>http://www.stanford.edu/~vcs</p>
<p>
<p>
Chris Wiggins<br />
Department of Applied Physics and Applied Mathematics, Columbia University, New York, NY</p>
<p>http://www.columbia.edu/~chw2</p>
<p>
<p>
<b>References</b> These issues were discussed at a roundtable convened by one of the authors on research sharing issues held at Yale Law School on November 21, 2009.  The webpage, along with thought pieces and research materials, is located at <a href=http://www.stanford.edu/~vcs/Conferences/RoundtableNov212209>http://www.stanford.edu/~vcs/Conferences/RoundtableNov212209/</a>.</p>
<p>
Crossposted at <a href=http://blog.ostp.gov/2009/12/21/policy-forum-on-public-access-to-federally-funded-research-features-and-technology>http://blog.ostp.gov/2009/12/21/policy-forum-on-public-access-to-federally-funded-research-features-and-technology/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stodden.net/2009/12/28/post-2-the-ostp%e2%80%99s-call-for-comments-regarding-public-access-policies-for-science-and-technology-funding-agencies-across-the-federal-government/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Climate Modeling Leak: Code and Data Generating Published Results Must be Open and Facilitate Reproducibility</title>
		<link>http://blog.stodden.net/2009/11/30/the-climate-modeling-leak-code-and-data-generating-published-results-must-be-open-and-facilitate-reproducibility/</link>
		<comments>http://blog.stodden.net/2009/11/30/the-climate-modeling-leak-code-and-data-generating-published-results-must-be-open-and-facilitate-reproducibility/#comments</comments>
		<pubDate>Mon, 30 Nov 2009 15:15:29 +0000</pubDate>
		<dc:creator>vcs</dc:creator>
				<category><![CDATA[Law]]></category>
		<category><![CDATA[Open Science]]></category>
		<category><![CDATA[Reproducible Research]]></category>
		<category><![CDATA[Scientific Method]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.stodden.net/?p=114</guid>
		<description><![CDATA[On November 20 documents including email and code spanning more than a decade were leaked from the Computing Climatic Research Unit (CRU) at East Anglia University in the UK. The Leak Reveals a Failure of Reproducibility of Computational Results It appears as though the leak came about through a long battle to get the CRU [...]]]></description>
			<content:encoded><![CDATA[<p>On November 20 <a href="http://blogs.sciencemag.org/scienceinsider/2009/11/climate-hack-sc.html">documents including email and code spanning more than a decade were leaked</a> from the <a href="http://www.cru.uea.ac.uk/"><del datetime="2009-12-29T02:39:44+00:00">Computing</del> Climatic Research Unit (CRU) at East Anglia University</a> in the UK.</p>
<h5>The Leak Reveals a Failure of Reproducibility of Computational Results</h5>
<p>It appears as though the leak came about through a long battle to get the CRU scientists to reveal the code and data associated with published results, and highlights a crack in the scientific method as practiced in computational science. Publishing standards have not yet adapted to the relatively new computational methods used pervasively across scientific research today.</p>
<p>Other branches of science have long-established methods to bring reproducibility into their practice. Deductive or mathematical results are published only with proofs, and there are long established standards for an acceptable proof. Empirical science contains clear mechanisms for communication of methods with the goal of facilitation of replication. Computational methods are a relatively new addition to a scientist&#8217;s toolkit, and the scientific community is only just establishing similar standards for verification and reproducibility in this new context. Peer review and journal publishing have generally not yet adapted to the use of computational methods and still operate as suitable for the deductive or empirical branches, creating a growing credibility gap in computational science.</p>
<p>The key point emerging from the leak of the CRU docs is that without the code and data it is all but impossible to tell whether the research is right or wrong, and this community&#8217;s lack of awareness of reproducibility and blustery demeanor does not inspire confidence in their production of reliable knowledge. This leak and the ensuing embarrassment would not have happened if code and data that permit reproducibility had been released alongside the published results. When mature, computational science will produce <em>routinely verifiable results</em>.</p>
<h5>Verifying Computational Results without Clear Communication of the Steps Taken is Near-Impossible</h5>
<p>The frequent near-impossibility of verification of computational results when reproducibility is not considered a research goal is shown by the miserable travails of &#8220;Harry,&#8221; a CRU employee with access to their system who was trying to reproduce the temperature results. The leaked documents contain <a href="http://pajamasmedia.com/blog/climategate-computer-codes-are-the-real-story/">logs of his unsuccessful attempts</a>. It seems reasonable to conclude that CRU&#8217;s published results aren&#8217;t reproducible if Harry, an insider, was unable to do so after four years.</p>
<p>This example also illustrates why a decision to leave reproducibility to others, beyond a cursory description of methods in the published text, is wholly inadequate for computational science. Harry seems to have had access to the data and code used and he couldn&#8217;t replicate the results. The merging and preprocessing of data in preparation for modeling and estimation encompasses a potentially very large number of steps, and a change in any one could produce different results. Just as when fitting models or running simulations, parameter settings and function invocation sequences must be communicated, again because the final results are a culmination of many decisions and without this information each small step must match the original work &#8211; a Herculean task. Responding with raw data when questioned about computational results is merely a canard, not intended to seriously facilitate reproducibility.</p>
<p>The story of Penn State professor of meteorology <a href="http://www.meteo.psu.edu/~mann/Mann/">Michael Mann</a>&#8216;s famous hockey stick temperature time series estimates is an example where lack of verifiability had important consequences. In <del datetime="2009-12-29T02:39:44+00:00">February</del> 2005 <a href="http://www.greenworldtrust.org.uk/Science/Social/HS%20evidence.htm">two panels examined the integrity of his work and debunked the results</a>, largely from work done by <a href="http://www4.stat.ncsu.edu/~bloomfld/">Peter Bloomfield</a>, a statistics professor at North Carolina State University, and <a href="http://www.galaxy.gmu.edu/stats/faculty/wegman.html">Ed Wegman</a>, statistics professor at George Mason University. (See also <a href="http://www.ncpa.org/pub/ba478/">this site</a> for further explanation of statistical errors.) Release of the code and data used to generate the results in the hockey stick paper likely would have caught the errors earlier, avoided the convening of the panels to assess the papers, and prevented the widespread promulgation of incorrect science. The hockey stick is a dramatic illustration of global warming and became something of a logo for the U.N.&#8217;s <a href="http://www.ipcc.ch/">Intergovernmental Panel of Climate Change</a> (IPCC). Mann was an author of the 2001 IPCC Assessment report, and was a lead author on the <a href="http://www.copenhagendiagnosis.com/">&#8220;Copenhagen Diagnosis,&#8221;</a> a report released Nov 24 and intended to synthesize the hundreds of research papers about human-induced climate change that have been published since the last assessment by the IPCC two years ago. The report was prepared in advance of the Copenhagen climate summit scheduled for Dec 7-18. Emails between CRU researchers and Mann <a>are included in the leak</a>, which happened right before the release of the Copenhagen Diagnosis (a quick <a href="http://www.eastangliaemails.com/search.php">search of the leaked emails</a> for &#8220;Mann&#8221; provided 489 matches).</p>
<p>These reports are important in part because of their impact on policy, as <a href="http://www.cbsnews.com/blogs/2009/11/24/taking_liberties/entry5761180.shtml?tag=contentMain;contentBody">CBS news</a> reports, &#8220;In global warming circles, the CRU wields outsize influence: it claims the world&#8217;s largest temperature data set, and its work and mathematical models were incorporated into the United Nations Intergovernmental Panel on Climate Change&#8217;s <a href="http://www.ipcc.ch/publications_and_data/publications_and_data_reports.htm#1">2007 report</a>. That report, in turn, is what the Environmental Protection Agency <a href="http://epa.gov/climatechange/endangerment/downloads/EPA-HQ-OAR-2009-0171-0001.pdf">acknowledged</a> it &#8220;relies on most heavily&#8221; when <a href="http://www.cbsnews.com/stories/2009/04/17/national/main4952104.shtml">concluding</a> that carbon dioxide emissions endanger public health and should be regulated.&#8221;</p>
<h5>Discussions of Appropriate Level of Code and Data Disclosure on RealClimate.org, Before and After the CRU Leak</h5>
<p>For years researchers had requested the data and programs used to produce Mann&#8217;s Hockey Stick result, and were resisted. The repeated requests for code and data culminated in Freedom of Information (FOI) requests, in particular those made by Willis Eschenbach, who tells <a href=http://omniclimate.wordpress.com/2009/11/24/willis-vs-the-cru-a-history-of-foi-evasion/>his story</a> of requests he made for underlying code and data up until the time of the leak. It appears that a file, FOI2009.zip, <a href="http://wattsupwiththat.com/2009/11/23/the-crutape-letters%C2%AE-an-alternate-explanation/#more-13003">was placed on CRU&#8217;s FTP server and then comments alerting people to its existence were posted on several key blogs</a>.</p>
<p>The thinking regarding disclosure of code and data in one part of the climate change community is illustrated in <a href="http://www.realclimate.org/index.php/archives/2009/02/antarctic-warming-is-robust/langswitch_lang/aw/">this fascinating discussion</a> on the blog <a href="http://www.realclimate.org/">RealClimate.org</a> in February. (Thank you to <a href="http://michaelnielsen.org/blog/">Michael Nielsen</a> for <a href="http://michaelnielsen.org/blog/biweekly-links-for-02092009/">the pointer</a>.) RealClimate.org has <a href="http://www.realclimate.org/index.php/archives/category/extras/contributor-bios/">5 primary authors, one of whom is Michael Mann</a>, and its primary author is Gavin Schmidt who was described earlier this year as a &#8220;<a href="http://www.guardian.co.uk/commentisfree/cifamerica/2009/feb/06/antarctic-warming-climate-change">computer jockeys for Nasa&#8217;s James Hansen, the world&#8217;s loudest climate alarmist</a>.&#8221; In this RealClimate blog post from November 27, <a href="http://www.realclimate.org/index.php/archives/2009/11/wheres-the-data/">Where&#8217;s the Data</a>, the position seems to be now very much all in favor of data release, but the first comment asks for the steps taken in reconstructing the results as well. This is right &#8211;  reproducibility of results should be the concern but does not yet appear to be taken seriously (as also argued <a href="http://nextbigfuture.com/2009/11/open-and-transparent-data-needed-for.html">here</a>).</p>
<h5>Policy and Public Relations</h5>
<p><a href="http://thehill.com/">The Hill</a>&#8216;s <a href="http://thehill.com/blogs/blog-briefing-room">Blog Briefing Room</a> reported that Senator Inhofe (R-Okla.) will investigate <a href="http://thehill.com/blogs/blog-briefing-room/news/69141-inhofe-to-call-for-hearing-into-cru-un-climate-change-research">whether the IPCC &#8220;cooked the science to make this thing look as if the science was settled, when all the time of course we knew it was not.&#8221;</a> With the current emphasis on evidence-based policy making, Inhofe&#8217;s review should recommend code and data release and require reliance on verified scientific results in policy making. The <a href="http://thomas.loc.gov/cgi-bin/query/z?c111:S.1373:">Federal Research Public Access Act</a> should be modified to include reproducibility in publicly funded research.</p>
<p>A dangerous ramification from the leak could be an undermining of public confidence in science and the conduct of scientists. My sense is that had this climate modeling community made its code and data readily available in a way that facilitated reproducibility of results, not only would they have avoided this embarrassment but the discourse would have been about scientific methods and results rather than <a href="http://pajamasmedia.com/blog/three-things-you-absolutely-must-know-about-climategate/">potential evasions of FOIA requests</a>, whether or not data were fudged, or <a href="http://www.usatoday.com/tech/news/2009-11-30-warming30_ST_N.htm">scientists acted improperly in squelching dissent</a> or <a href="http://pajamasmedia.com/blog/climategate-violating-the-social-contract-of-science/">manipulating journal editorial boards</a>. Perhaps data release is becoming an accepted norm, but code release for reproducibility must follow. The issue here is verification and reproducibility, without which it is all but impossible to tell whether the core science done at CRU was correct or not, even for peer reviewing scientists.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stodden.net/2009/11/30/the-climate-modeling-leak-code-and-data-generating-published-results-must-be-open-and-facilitate-reproducibility/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Software and Intellectual Lock-in in Science</title>
		<link>http://blog.stodden.net/2009/11/14/software-and-intellectual-lock-in-in-science/</link>
		<comments>http://blog.stodden.net/2009/11/14/software-and-intellectual-lock-in-in-science/#comments</comments>
		<pubDate>Sat, 14 Nov 2009 18:34:00 +0000</pubDate>
		<dc:creator>vcs</dc:creator>
				<category><![CDATA[Open Science]]></category>
		<category><![CDATA[Reproducible Research]]></category>
		<category><![CDATA[Scientific Method]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://blog.stodden.net/?p=108</guid>
		<description><![CDATA[In a recent discussion with a friend, a hypothesis occurred to me: that increased levels of computation in scientific research could cause greater intellectual lock-in to particular ideas. Examining how ideas change in scientific thinking isn&#8217;t new. Thomas Kuhn for example caused a revolution himself in how scientific progress is understood with his 1962 book [...]]]></description>
			<content:encoded><![CDATA[<p>In a recent discussion with a friend, a hypothesis occurred to me: that increased levels of computation in scientific research could cause greater intellectual lock-in to particular ideas.</p>
<p>Examining how ideas change in scientific thinking isn&#8217;t new. Thomas Kuhn for example caused a revolution himself in how scientific progress is understood with his 1962 book <a href="http://www.amazon.com/Structure-Scientific-Revolutions-Thomas-Kuhn/dp/0226458083/ref=sr_1_3?ie=UTF8&amp;s=books&amp;qid=1258052947&amp;sr=8-3">The Structure of Scientific Revolutions</a>. The notion of technological lock-in isn&#8217;t new either, see for example Paul David&#8217;s examination of how we ended up with the non-optimal QWERTY keyboard (<a href="http://www.utdallas.edu/~liebowit/knowledge_goods/david1985aer.htm">&#8220;Clio and the Economics of QWERTY,&#8221; AER, 75(2), 1985</a>) or Brian Arthur&#8217;s &#8220;Competing Technologies and Lock-in by Historical Events: The Dynamics of Allocation Under Increasing Returns&#8221; (Economic Journal, 99, 1989).</p>
<p>Computer-based methods are relatively new to scientific research, and are reaching even the most seemingly uncomputational edges of the humanities, like English literature and archaeology. Did Shakespeare really write all the plays attributed to him? Let&#8217;s see if <a href="http://wordhoard.northwestern.edu/userman/scripting-example.html">word distributions by play are significantly different</a>; or can we <a href="http://www.pbs.org/wgbh/nova/ubar/tools/"> use signal processing to &#8220;see&#8221; artifacts without unearthing them</a>, and thereby <a href="http://www.newscientist.com/article/dn4430-chemists-stop-flaky-fate-of-terracotta-warriors.html">preserving artifact features</a>?</p>
<p>Software has the property of encapsulating ideas and methods for scientific problem solving. Software also has a second property: brittleness, it breaks before it bends. Computing hardware has grown steadily in capability, speed, reliability, and capacity, but as Jaron Lanier describes in <a href="http://www.edge.org/3rd_culture/lanier03/lanier_index.html">his essay on The Edge</a>, trends in software are &#8220;a macabre parody of Moore&#8217;s Law&#8221; and the &#8220;moment programs grow beyond smallness, their brittleness becomes the most prominent feature, and software engineering becomes Sisyphean.&#8221; My concern is that as ideas become increasingly manifest as code, with all the scientific advancement that can imply, it becomes more difficult to adapt, modify, and change the underlying scientific approaches. We become, as scientists, more locked into particular methods for solving scientific questions and particular ways of thinking.</p>
<p>For example, what happens when an approach to solving a problem is encoded in software and becomes a standard tool? Many such tools exist, and are vital to research &#8211; just look at the list at <a href="http://salilab.org/">Andrej Sali&#8217;s highly regarded lab</a> at UCSF, or the <a href="http://cran.r-project.org/web/packages/">statistical packages in the widely used language R</a>, for example. <a href="http://www.in-cites.com/scientists/DrDavidDonoho.html">David Donoho laments the now widespread use of test cases</a> he released online to illustrate his methods for particular types of data, &#8220;I have seen numerous papers and conference presentations referring to &#8220;Blocks,&#8221; &#8220;Bumps,&#8221; &#8220;HeaviSine,&#8221; and &#8220;Doppler&#8221; as standards of a sort (this is a practice I object to but am powerless to stop; I wish people would develop new test cases which are more appropriate to illustrate the methodology they are developing).&#8221; Code and ideas should be reused and built upon, but at what point does the cost of recoding outweigh the scientific cost of not improving the method? In fact, perhaps counterintuitively, it&#8217;s hardware that is routinely upgraded and replaced, not the seemingly ephemeral software.</p>
<p>In his essay Lanier argues that the brittle state of software today results from metaphors used by the first computer scientists &#8211; electronic communications devices that sent signals on a wire. It&#8217;s an example of intellectual lock-in itself that&#8217;s become hardened in how we encode ideas as machine instructions now.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.stodden.net/2009/11/14/software-and-intellectual-lock-in-in-science/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
