Archive for the 'Technology' Category

Regulatory steps toward open science and reproducibility: we need a science cloud

This past January Obama signed the America COMPETES Re-authorization Act. It contains two interesting sections that advance the notions of open data and the federal role in supporting online access to scientific archives: 103 and 104, which read in part:

• § 103: “The Director [of the Office of Science and Technology Policy at the Whitehouse] shall establish a working group under the National Science and Technology Council with the responsibility to coordinate Federal science agency research and policies related to the dissemination and long-term stewardship of the results of unclassified research, including digital data and peer-reviewed scholarly publications, supported wholly, or in part, by funding from the Federal science agencies.” (emphasis added)

This is a cause for celebration insofar as Congress has recognized that published articles are an incomplete communication of computational scientific knowledge, and the data (and code) must be included as well.

• § 104: Federal Scientific Collections: The Office of Science and Technology Policy “shall develop policies for the management and use of Federal scientific collections to improve the quality, organization, access, including online access, and long-term preservation of such collections for the benefit of the scientific enterprise.” (emphasis added)

I was very happy to see the importance of online access recognized, and hopefully this will include the data and code that underlies published computational results.

One step further in each of these directions: mention code explicitly and create a federally funded cloud not only for data but linked to code and computational results to enable reproducibility.

Open peer review of science: a possibility

The Nature journal Molecular Systems Biology published an editorial “From Bench to Website” explaining their move to a transparent system of peer review. Anonymous referee reports, editorial decisions, and author responses are published alongside the final published paper. When this exchange is published, care is taken to preserve anonymity of reviewers and to not disclose any unpublished results. Authors also have the ability to opt out and request their review information not be published at all.

Here’s an example of the commentary that is being published alongside the final journal article.

Their move follows on a similar decision taken by The EMBO Journal (European Molecular Biology Organization) as described in an editorial here where they state that the “transparent editorial process will make the process that led to acceptance of a paper accessible to all, as well as any discussion of merits and issues with the paper.” Their reasoning cites problems in the process of scientific communication and they give an example by Martin Raff which was published as a letter to the editor called “Painful Publishing” (behind a paywall, apologies). Raff laments the power of the anonymous reviewers to demand often unwarranted additional experimentation as a condition of publication: “authors are so keen to publish in these select journals that they are willing to carry out extra, time consuming experiments suggested by referees, even when the results could strengthen the conclusions only marginally. All too often, young scientists spend many months doing such ‘referees’ experiments.’ Their time and effort would frequently be better spent trying to move their project forward rather than sideways. There is also an inherent danger in doing experiments to obtain results that a referee demands to see.”

Rick Trebino, physics professor at Georgia Tech, penned a note detailing the often incredible steps he went through in trying to publish a scientific comment: “How to Publish a Scientific Comment in 1 2 3 Easy Steps.” It describes deep problems in our scientific discourse today. The recent clinical trials scandal at Duke University is another example of failed scientific communication. Many efforts were made to print correspondences regarding errors in published papers that may have permitted problems in the research to have been addressed earlier.

The editorial in Molecular Systems Biology also announces that the journal is joining many others in adopting a policy of encouraging the upload of the data that underlies results in the paper to be published alongside the final article. They go one step further and provide links from the figure in the paper to its underlying data. They give an example of such linked figures here. My question is how this dovetails with recent efforts by Donoho and Gavish to create a system of universal figure-level identifiers for published results, and the work of Altman and King to design Universal Numerical Fingerprints (UNFs) for data citation.

Science and Video: a roadmap

Once again I find myself in the position of having collected slides from talks, and having audio from the sessions. I need a simple way to pin these together so they form a coherent narrative and I need a common sharing platform. We don’t really have to see the speaker to understand the message but we needs the slides and the audio to play in tandem with the slides changing at the correct points. Some of the files are quite large: slides decks can be over 100MB and right now the audio file I have is 139MB (slideshare has size limits that don’t accomodate this).

I’m writing because I feel the messages are important, and need to be available to a wider audience. This is often our culture, our heritage, our technology, our scientific knowledge and our shared understanding. These presentations need to be available not just on principled open access grounds, but it is imperative that other scientists hear these messages as well, amplifying scientific communication.

At a bar the other night a friend and I came up with the idea of S-SPAN: a C-SPAN for science. Talks and conferences could be filmed and shared widely on an internet platform. Of course these platforms exist and some even target scientific talks but the content also needs to be marshalled and directed onto the website. Some of the best stuff I’ve even seen has floated into the ether.

So, I make an open call for these two tasks: a simple tool to pin together slides and audio (and sides and video), and an effort to collate video from scientific conference talks and film them if it doesn’t exist, all onto a common distribution platform. S-SPAN could start as raw and underproduced as C-SPAN, but I am sure it would develop from there.

I’m looking at you, YouTube.

My Symposium at the AAAS Annual Meeting: The Digitization of Science

Yesterday I held a symposium at the AAAS Annual Meeting in Washington DC, called “The Digitization of Science: Reproducibility and Interdisciplinary Knowledge Transfer,” that was intended to bring attention to how massive computation is changing the practice of science, particularly the lack of reproducibility of published computational scientific results. The fact is, most computational scientific results published today are unverified and unverifiable. I’ve created a page for the event here, with links to slide decks and abstracts. I couldn’t have asked for a better symposium, thanks to the wonderful speakers.

The first speaker was Keith A. Baggerly, who (now famously) tried to verify published results in Nature Medicine and uncovered a series of errors that led to the termination of clinical trials at Duke that were based on the original findings, and the resignation of one of the investigators (his slides). I then spoke about policies for realigning the IP framework scientists are under with their longstanding norms, to permit sharing of code and data (my slides). Fernando Perez described how computational scientists can learn about not only code sharing, quality control, and project management from the Open Source Software, but how they have in fact developed what is in effect a deeply successful system of peer review for code. Code is verified line by line before incorporated into the project, and there are software tools to enable the communication between reviewer and submitted, down to the line of code (his slides).

Michael Reich then presented GenePattern, an OS independent tool developed with Microsoft for creating data analysis pipelines and incorporating them into a Word doc. Once in the document, tools exist to click and recreate the figure from the pipeline and examine what’s been done to the data. Robert Gentlemen advocated the entire research paper as the unit of reproducibility, and David Donoho presented a method for assigning a unique identifier to figures within the paper, that creates a link for each figure and permits its independent reproduction (the slides). The final speaker was Mark Liberman, who showed how the human language technology community had developed a system of open data and code in their efforts to reduce errors in machine understanding of language (his slides). All the talks pushed on delineations of science from non-science, and it was probably best encapsulated with a quote Mark introduced from John Pierce, a Bell Labs executive in 1969, how “To sell suckers, one uses deceit and offers glamor.”

There was some informal feedback, with a prominent person saying that this session was “one of the most amazing set of presentations I have attended in recent memory.” Have a look at all the slides and abstracts, including links and extended abstracts.

Update: Here are some other blog posts on the symposium: Mark Liberman’s blog and Fernando Perez’s blog.

Letter Re Software and Scientific Publications – Nature

Mark Gerstein and I penned a reaction to two pieces published in Nature News last October, “Publish your computer code: it is good enough,” by Nick Barnes and “Computational Science…. Error” by Zeeya Merali. Nature declined to publish our note and so here it is.

Dear Editor,

We have read with great interest the recent pieces in Nature about the importance of computer codes associated with scientific manuscripts. As participants in the Yale roundtable mentioned in one of the pieces, we agree that these codes must be constructed robustly and distributed widely. However, we disagree with an implicit assertion, that the computer codes are a component separate from the actual publication of scientific findings, often neglected in preference to the manuscript text in the race to publish. More and more, the key research results in papers are not fully contained within the small amount of manuscript text allotted to them. That is, the crucial aspects of many Nature papers are often sophisticated computer codes, and these cannot be separated from the prose narrative communicating the results of computational science. If the computer code associated with a manuscript were laid out according to accepted software standards, made openly available, and looked over as thoroughly by the journal as the text in the figure legends, many of the issues alluded to in the two pieces would simply disappear overnight.

The approach taken by the journal Biostatistics serves as an exemplar: code and data are submitted to a designated “reproducibility editor” who tries to replicate the results. If he or she succeeds, the first page of the article is kitemarked “R” (for reproducible) and the code and data made available as part of the publication. We propose that high-quality journals such as Nature not only have editors and reviewers that focus on the prose of a manuscript but also “computational editors” that look over computer codes and verify results. Moreover, many of the points made here in relation to computer codes apply equally well to large datasets that underlie experimental manuscripts. These are often organized, formatted, and deposited into databases as an afterthought. Thus, one could also imagine a “data editor” who would look after these aspects of a manuscript. All in all, we have to come to the realization that current scientific papers are more complicated than just a few thousand words of narrative text and a couple of figures, and we need to update journals to handle this reality.

Yours sincerely,

Mark Gerstein (1,2,3)
Victoria Stodden (4)

(1) Program in Computational Biology and Bioinformatics,
(2) Department of Molecular Biophysics and Biochemistry, and
(3) Department of Computer Science,
Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520

(4) Department of Statistics, Columbia University, 1255 Amsterdam Ave, New York, NY 10027

Startups Awash in Data: Quantitative Thinkers Needed

We know unix logs everything, which makes web-based data collection easy, in fact almost difficult not to do. As a result internet startups often find themselves gathering enormous amounts of data, for example site use patterns, click-streams, user demographics and preference functions, purchase histories… Many of these companies know they are sitting on a goldmine, but how to extract the relevant information from these scads of data? More precisely, how to predict user behavior and preferences better?

Statisticians, particularly through machine learning, have been working on this problem for a long time. Since I’ve arrived in New York City from Silicon Valley I’ve observed an enormous amount of quantitative talent here, at least in part due to the influence of the finance industry. But these quantitative skills are precisely what’s needed to make sense of the data collected by startups, and here it looks like NYC has an edge over Silicon Valley. Friends Evan Korth, Hilary Mason, and Chris Wiggins (two professors and a former professor) are building bridges to connect these two worlds. Their primary effort, HackNY, is a summer program linking students with quantitative talent with startups in need. (Wiggins’ mantra is to “get the kids off the street” by giving them alternatives to entering the finance profession.)

The New York startup scene is distinguishing itself from Silicon Valley by efforts to make direct use of the abundance of quantitative skills available here. Hilary and Chris created an excellent guideline for data-driven analysis in the startup context, “A Taxonomy of Data Science:” Obtain, Scrub, Explore, Model, and iNterpret. These data are often measuring phenomena in new ways, using novel data structures, and providing new opportunities for innovative data research and model building. Lots of data, lots of skill – great for statisticians and folks with an interest in learning from data, as well as for those collecting the data.

Open Data Dead on Arrival

In 1984 Karl Popper wrote a private letter to an inquirer he didn’t know, responding to enclosed interview questions. The response was subsequently published and in it he wrote, among other things, that:

“Every intellectual has a very special responsibility. He has the privilege and opportunity of studying. In return, he owes it to his fellow men (or ‘to society’) to represent the results of his study as simply, clearly and modestly as he can. The worst thing that intellectuals can do — the cardinal sin — is to try to set themselves up as great prophets vis-a-vis their fellow men and to impress them with puzzling philosophies. Anyone who cannot speak simply and clearly should say nothing and continue to work until he can do so.”

Aside from the offensive sexism in referring to intellectuals as males, there is another way this imperative should be updated for intellectualism today. The movement to make data available online is picking up momentum — as it should — and open code is following suit (see for example). But data should not be confused with facts, and applying the simple communication that Popper refers to beyond the written or spoken word is the only way open data will produce dividends. It isn’t enough to post raw data, or undocumented code. Data and code should be considered part of intellectual communication, and made as simple as possible for “fellow men” to understand. Just as knowledge of adequate English vocabulary is assumed in the nonquantitative communication Popper refers to, certain basic coding and data knowledge can be assumed as well. This means the same thing as it does in the literary case; the elimination of extraneous information and obfuscating terminology. No need to bury interested parties in an Enron-like shower of bits. It also means using a format for digital communication that is conducive to reuse, such as a flat text file or another non-proprietary format, for example pdf files cannot be considered acceptable to either data or code. Facilitating reproducibility must be the gold standard for data and code release.

And who are these “fellow men”?

Well, fellow men and women that is, but back to the issue. Much of the history of scientific communication has dealt with the question of demarcation of the appropriate group to whom the reasoning behind the findings would be communicated, the definition of the scientific community. Clearly, communication of very technical and specialized results to a layman would take intellectuals’ time away from doing what they do best, being intellectual. On the other hand some investment in explanation is essential for establishing a finding as an accepted fact — assuring others that sufficient error has been controlled for and eliminated in the process of scientific discovery. These others ought to be able to verify results, find mistakes, and hopefully build on the results (or the gaps in the theory) and thereby further our understanding. So there is a tradeoff. Hence the establishment of the Royal Society for example as a body with the primary purpose of discussing scientific experiments and results. Couple this with Newton’s surprise, or even irritation, at having to explain results he put forth to the Society in his one and only journal publication in their journal Philosophical Transactions (he called the various clarifications tedious, and sought to withdraw from the Royal Society and subsequently never published another journal paper. See the last chapter of The Access Principle). There is a mini-revolution underfoot that has escaped the spotlight of attention on open data, open code, and open scientific literature. That is, the fact that the intent is to open to the public. Not open to peers, or appropriately vetted scientists, or selected ivory tower mates, but to anyone. Never before has the standard for communication been “everyone,” in fact quite the opposite. Efforts had traditionally been expended narrowing and selecting the community privileged enough to participate in scientific discourse.

So what does public openness mean for science?

Recall the leaked files from the University of East Anglia’s Climatic Research Unit last November. Much of the information revealed concerned scientifically suspect (and ethically dubious) attempts not to reveal data and methods underlying published results. Although that tack seems to have softened now some initial responses defended the climate scientists’ right to be closed with regard to their methods due to the possibility of “denial of service attacks” – the ripping apart of methodology (recall all science is wrong, an asymptotic progression toward to truth at best) not with the intent of finding meaningful errors that halt the acceptance of findings as facts, but merely to tie up the climate scientists so they cannot attend to real research. This is the same tradeoff as described above. An interpretation of this situation cannot be made without the complicating realization that peer review — the review process that vets articles for publication — doesn’t check computational results but largely operates as if the papers are expounding results from the pre-computational scientific age. The outcome, if computational methodologies are able to remain closed from view, is that they are directly vetted nowhere. Hardly an acceptable basis for establishing facts. My own view is that data and code must be communicated publicly with attention paid to Popper’s admonition: as simply and clearly as possible, such that the results can be replicated. Not participating in dialog with those insufficiently knowledgable to engage will become part of our scientific norms, in fact this is enshrined in the structure of our scientific societies of old. Others can take up those ends of the discussion, on blogs, in digital forums. But public openness is important not just because taxpayers have a right to what they paid for (perhaps they do, but this quickly falls apart since not all the public are technically taxpayers and that seems a wholly unjust way of deciding who shall have access to scientific knowledge and who not, clearly we mean society), but because of the increasing inclusiveness of the scientific endeavor. How do we determine who is qualified to find errors in our scientific work? We don’t. Real problems will get noticed regardless of with whom they originate, many eyes making all bugs shallow. And I expect peer review for journal publishing to incorporate computational evaluation as well.

Where does this leave all the open data?

Unused, unless efforts are expended to communicate the meaning of the data, and to maximize the usability of the code. Data is not synonymous with facts – methods for understanding data, and turning its contents into facts, are embedded within the documentation and code. Take for granted that users understand the coding language or basic scientific computing functions, but clearly and modestly explain the novel contributions. Facilitate reproducibility. Without this data may be open, but will remain de facto in the ivory tower.

Ars technica article on reproducibility in science

John Timmer wrote an excellent article called “Keeping computers from ending science’s reproducibility.” I’m quoted in it. Here’s an excellent follow up blog post by Grant Jacobs, “Reproducible Research and computational biology.”

My answer to the Edge Annual Question 2010: How is the Internet Changing the Way You Think?

At the end of every year editors at my favorite website The Edge ask intellectuals to answer a thought-provoking question. This year it was “How is the internet changing the way you think?” My answer is posted here:

Post 2: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the second wave of the OSTP’s call as posted here: The first wave, comments posted here and on the OSTP site here (scroll to the second last comment), asked for feedback on implementation issues. The second wave requests input on Features and Technology and Chris Wiggins and I posted the following comments:

We address each of the questions for phase two of OSTP’s forum on public access in turn. The answers generally depend on the community involved and (particularly question 7, asking for a cost estimate) on the scale of implementation. Inter-agency coordination is crucial however in (i) providing a centralized repository to access agency-funded research output and (ii) encouraging and/or providing a standardized tagging vocabulary and structure (as discussed further below).

Continue reading ‘Post 2: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government’

The Climate Modeling Leak: Code and Data Generating Published Results Must be Open and Facilitate Reproducibility

On November 20 documents including email and code spanning more than a decade were leaked from the Computing Climatic Research Unit (CRU) at East Anglia University in the UK.

The Leak Reveals a Failure of Reproducibility of Computational Results

It appears as though the leak came about through a long battle to get the CRU scientists to reveal the code and data associated with published results, and highlights a crack in the scientific method as practiced in computational science. Publishing standards have not yet adapted to the relatively new computational methods used pervasively across scientific research today.

Other branches of science have long-established methods to bring reproducibility into their practice. Deductive or mathematical results are published only with proofs, and there are long established standards for an acceptable proof. Empirical science contains clear mechanisms for communication of methods with the goal of facilitation of replication. Computational methods are a relatively new addition to a scientist’s toolkit, and the scientific community is only just establishing similar standards for verification and reproducibility in this new context. Peer review and journal publishing have generally not yet adapted to the use of computational methods and still operate as suitable for the deductive or empirical branches, creating a growing credibility gap in computational science.

The key point emerging from the leak of the CRU docs is that without the code and data it is all but impossible to tell whether the research is right or wrong, and this community’s lack of awareness of reproducibility and blustery demeanor does not inspire confidence in their production of reliable knowledge. This leak and the ensuing embarrassment would not have happened if code and data that permit reproducibility had been released alongside the published results. When mature, computational science will produce routinely verifiable results.

Verifying Computational Results without Clear Communication of the Steps Taken is Near-Impossible

The frequent near-impossibility of verification of computational results when reproducibility is not considered a research goal is shown by the miserable travails of “Harry,” a CRU employee with access to their system who was trying to reproduce the temperature results. The leaked documents contain logs of his unsuccessful attempts. It seems reasonable to conclude that CRU’s published results aren’t reproducible if Harry, an insider, was unable to do so after four years.

This example also illustrates why a decision to leave reproducibility to others, beyond a cursory description of methods in the published text, is wholly inadequate for computational science. Harry seems to have had access to the data and code used and he couldn’t replicate the results. The merging and preprocessing of data in preparation for modeling and estimation encompasses a potentially very large number of steps, and a change in any one could produce different results. Just as when fitting models or running simulations, parameter settings and function invocation sequences must be communicated, again because the final results are a culmination of many decisions and without this information each small step must match the original work – a Herculean task. Responding with raw data when questioned about computational results is merely a canard, not intended to seriously facilitate reproducibility.

The story of Penn State professor of meteorology Michael Mann‘s famous hockey stick temperature time series estimates is an example where lack of verifiability had important consequences. In February 2005 two panels examined the integrity of his work and debunked the results, largely from work done by Peter Bloomfield, a statistics professor at North Carolina State University, and Ed Wegman, statistics professor at George Mason University. (See also this site for further explanation of statistical errors.) Release of the code and data used to generate the results in the hockey stick paper likely would have caught the errors earlier, avoided the convening of the panels to assess the papers, and prevented the widespread promulgation of incorrect science. The hockey stick is a dramatic illustration of global warming and became something of a logo for the U.N.’s Intergovernmental Panel of Climate Change (IPCC). Mann was an author of the 2001 IPCC Assessment report, and was a lead author on the “Copenhagen Diagnosis,” a report released Nov 24 and intended to synthesize the hundreds of research papers about human-induced climate change that have been published since the last assessment by the IPCC two years ago. The report was prepared in advance of the Copenhagen climate summit scheduled for Dec 7-18. Emails between CRU researchers and Mann are included in the leak, which happened right before the release of the Copenhagen Diagnosis (a quick search of the leaked emails for “Mann” provided 489 matches).

These reports are important in part because of their impact on policy, as CBS news reports, “In global warming circles, the CRU wields outsize influence: it claims the world’s largest temperature data set, and its work and mathematical models were incorporated into the United Nations Intergovernmental Panel on Climate Change’s 2007 report. That report, in turn, is what the Environmental Protection Agency acknowledged it “relies on most heavily” when concluding that carbon dioxide emissions endanger public health and should be regulated.”

Discussions of Appropriate Level of Code and Data Disclosure on, Before and After the CRU Leak

For years researchers had requested the data and programs used to produce Mann’s Hockey Stick result, and were resisted. The repeated requests for code and data culminated in Freedom of Information (FOI) requests, in particular those made by Willis Eschenbach, who tells his story of requests he made for underlying code and data up until the time of the leak. It appears that a file,, was placed on CRU’s FTP server and then comments alerting people to its existence were posted on several key blogs.

The thinking regarding disclosure of code and data in one part of the climate change community is illustrated in this fascinating discussion on the blog in February. (Thank you to Michael Nielsen for the pointer.) has 5 primary authors, one of whom is Michael Mann, and its primary author is Gavin Schmidt who was described earlier this year as a “computer jockeys for Nasa’s James Hansen, the world’s loudest climate alarmist.” In this RealClimate blog post from November 27, Where’s the Data, the position seems to be now very much all in favor of data release, but the first comment asks for the steps taken in reconstructing the results as well. This is right – reproducibility of results should be the concern but does not yet appear to be taken seriously (as also argued here).

Policy and Public Relations

The Hill‘s Blog Briefing Room reported that Senator Inhofe (R-Okla.) will investigate whether the IPCC “cooked the science to make this thing look as if the science was settled, when all the time of course we knew it was not.” With the current emphasis on evidence-based policy making, Inhofe’s review should recommend code and data release and require reliance on verified scientific results in policy making. The Federal Research Public Access Act should be modified to include reproducibility in publicly funded research.

A dangerous ramification from the leak could be an undermining of public confidence in science and the conduct of scientists. My sense is that had this climate modeling community made its code and data readily available in a way that facilitated reproducibility of results, not only would they have avoided this embarrassment but the discourse would have been about scientific methods and results rather than potential evasions of FOIA requests, whether or not data were fudged, or scientists acted improperly in squelching dissent or manipulating journal editorial boards. Perhaps data release is becoming an accepted norm, but code release for reproducibility must follow. The issue here is verification and reproducibility, without which it is all but impossible to tell whether the core science done at CRU was correct or not, even for peer reviewing scientists.

Software and Intellectual Lock-in in Science

In a recent discussion with a friend, a hypothesis occurred to me: that increased levels of computation in scientific research could cause greater intellectual lock-in to particular ideas.

Examining how ideas change in scientific thinking isn’t new. Thomas Kuhn for example caused a revolution himself in how scientific progress is understood with his 1962 book The Structure of Scientific Revolutions. The notion of technological lock-in isn’t new either, see for example Paul David’s examination of how we ended up with the non-optimal QWERTY keyboard (“Clio and the Economics of QWERTY,” AER, 75(2), 1985) or Brian Arthur’s “Competing Technologies and Lock-in by Historical Events: The Dynamics of Allocation Under Increasing Returns” (Economic Journal, 99, 1989).

Computer-based methods are relatively new to scientific research, and are reaching even the most seemingly uncomputational edges of the humanities, like English literature and archaeology. Did Shakespeare really write all the plays attributed to him? Let’s see if word distributions by play are significantly different; or can we use signal processing to “see” artifacts without unearthing them, and thereby preserving artifact features?

Software has the property of encapsulating ideas and methods for scientific problem solving. Software also has a second property: brittleness, it breaks before it bends. Computing hardware has grown steadily in capability, speed, reliability, and capacity, but as Jaron Lanier describes in his essay on The Edge, trends in software are “a macabre parody of Moore’s Law” and the “moment programs grow beyond smallness, their brittleness becomes the most prominent feature, and software engineering becomes Sisyphean.” My concern is that as ideas become increasingly manifest as code, with all the scientific advancement that can imply, it becomes more difficult to adapt, modify, and change the underlying scientific approaches. We become, as scientists, more locked into particular methods for solving scientific questions and particular ways of thinking.

For example, what happens when an approach to solving a problem is encoded in software and becomes a standard tool? Many such tools exist, and are vital to research – just look at the list at Andrej Sali’s highly regarded lab at UCSF, or the statistical packages in the widely used language R, for example. David Donoho laments the now widespread use of test cases he released online to illustrate his methods for particular types of data, “I have seen numerous papers and conference presentations referring to “Blocks,” “Bumps,” “HeaviSine,” and “Doppler” as standards of a sort (this is a practice I object to but am powerless to stop; I wish people would develop new test cases which are more appropriate to illustrate the methodology they are developing).” Code and ideas should be reused and built upon, but at what point does the cost of recoding outweigh the scientific cost of not improving the method? In fact, perhaps counterintuitively, it’s hardware that is routinely upgraded and replaced, not the seemingly ephemeral software.

In his essay Lanier argues that the brittle state of software today results from metaphors used by the first computer scientists – electronic communications devices that sent signals on a wire. It’s an example of intellectual lock-in itself that’s become hardened in how we encode ideas as machine instructions now.

My Interview with ITConversations on Reproducible Research

On September 30, I was interviewed by Jon Udell from in his Interviews with Innovators series, on Reproducibility of Computational Science.

Here’s the blurb: “If you’re a writer, a musician, or an artist, you can use Creative Commons licenses to share your digital works. But how can scientists license their work for sharing? In this conversation, Victoria Stodden — a fellow with Science Commons — explains to host Jon Udell why scientific output is different and how Science Commons aims to help scientists share it freely.”

Optimal Information Disclosure Levels: and "Taleb's Criticism"

I was listening to the audio recording of last Friday’s “Scientific Data for Evidence Based Policy and Decision Making” symposium at the National Academies, and was struck by the earnest effort on the part of members of the Whitehouse to release governmental data to the public. Beth Noveck, Obama’s Deputy Chief Technology Officer for Open Government, frames the effort with a slogan, “Transparency, Participation, and Collaboration.” A plan is being developed by the Whitehouse in collaboration with the OMB to implement these three principles via a “massive release of data in open, downloadable, accessible for machine readable formats, across all agencies, not only in the Whitehouse,” says Beth. “At the heart of this commitment to transparency is a commitment to open data and open information..”

Vivek Kundra, Chief Information Officer in the Whitehouse’s Open Government Initiative, was even more explicit – saying that “the dream here is that you have a grad student, sifting through these datasets at 3 in the morning, who finds, at the intersection of multiple datasets, insight that we may not have seen, or developed a solution that we may not have thought of.”

This is an extraordinary vision. This discussion comes hot on the heels of a debate in Congress regarding the level of information they are willing to release to the public in advance of voting on a bill. Last Wednesday CBS reports, with regard to the health care bill, that “[t]he Senate Finance Committee considered for two hours today a Republican amendment — which was ultimately rejected — that would have required the “legislative” language of the committee’s final bill, along with a cost estimate for the bill, to be posted online for 72 hours before the committee voted on it. Instead, the committee passed a similar amendment, offered by Committee Chair Max Baucus (D-Mont.), to put online the “conceptual” or “plain” language of the bill, along with the cost estimate.” What is remarkable is the sense this gives that somehow the public won’t understand the raw text of the bill (I noticed no compromise position offered that would make both versions available, which seems an obvious solution).

The Whitehouse’s efforts have the potential to test this hypothesis: if given more information will people pull things out of context and promulgate misinformation? The Whitehouse is betting that they won’t, and Kundra does state the Whitehouse is accompanying dataset release with efforts to provide contextual meta-data for each dataset while safeguarding national security and individual privacy rights.

This sense of limits in openness isn’t unique to governmental issues and in my research on data and code sharing among scientists I’ve termed the concern “Taleb’s crticism.” In a 2008 essay on The Edge website, Taleb worries about the dangers that can result from people using statistical methodology without having a clear understanding of the techniques. An example of concern about Taleb’s Criticism appeared on UCSF’s EVA website, a repository of programs for automatic protein structure prediction. The UCSF researchers won’t release their code publicly because, as stated on their website, “We are seriously concerned about the ‘negative’ aspect of the freedom of the Web being that any newcomer can spend a day and hack out a program that predicts 3D structure, put it on the web, and it will be used.” Like the congressmen seemed to fear, for these folks openness is scary because people may misuse the information.

It could be argued, and for scientific research should be argued, that an open dialog of an idea’s merits is preferable to no dialog at all, and misinformation can be countered and exposed. Justice Brandeis famously elucidated this point in Whitney v. California (1927), writing that “If there be time to expose through discussion the falsehood and fallacies, to avert the evil by the processes of education, the remedy to be applied is more speech, not enforced silence.” is an experiment in context and may bolster trust in the public release of complex information. Speaking of the project, Noveck explained that “the notion of making complex information more accessible to people and to make greater sense of that complex information was really at the heart.” This is a very bold move and it will be fascinating to see the outcome.

Crossposted on Yale Law School’s Information Society Project blog.

What's New at Science Foo Camp 2009

SciFoo is a wonderful annual gathering of thinkers about science. It’s an unconference and people who choose to speak do so. Here’s my reaction to a couple of these talks.

In Pete Worden’s discussion of modeling future climate change, I wondered about the reliability of simulation results. Worden conceded that there are several models doing the same predictions he showed, and they can give wildly opposing results. We need to develop the machinery to quantify error in simulation models just as we routinely do for conventional statistical modeling: simulation is often the only empirical tool we have for guiding policy responses to some of our most pressing issues.

But the newest I saw was Bob Metcalfe’s call for us to imagine what to do with the coming overabundance of energy. Metcalfe likened solving energy scarcity to the early days of Internet development: because of the generative design of Internet technology, we now have things that were unimagined in the early discussions, such as YouTube and online video. According to Metcalfe, we need to envision our future as including a “squanderable abundance” of energy, and use Internet lessons such as standardization and distribution of power sources to get there, rather than building for energy conservation.

Cross posted on The Edge.

Bill Gates to Development Researchers: Create and Share Statistics

I was recently in Doha, Qatar, presenting my research on global communication technology use and democratic tendency at ICTD09. I spoke right before the keynote, Bill Gates, whose main point was that when you engage in a goal-oriented activity, such as development, progress can only be made when you measure the impact of your efforts.

Gates paints a positive picture, measured by deaths before age 5. In the 1880’s he says about 30% of children died before their 5th birthday in most countries, and this gradually moved to 20 million in 1960 and then 10 million in 2006. Gates postulates this is due to rising income levels (40% of decrease), and medical innovation such as vaccines (60% of decrease).

This is an example of Gates’ mantra: you can only improve what you can measure. For example, an outbreak of measles tells you your vaccine system isn’t functioning. In his example about childhood deaths, he says we are getting somewhere here because we are measuring the value for money spent on the problem.

Gates thinks the wealthy in the world need to be exposed to these problems ideally through intermingling, or since that is unlikely to happen, through statistics and data visualization. Collect data, then communicate it. In short, Gates advocates creating statistics through measuring development efforts, and changing the world by exposing people to these data.

Wolfram|Alpha Demoed at Harvard: Limits on Human Understanding?

Yesterday Stephen Wolfram gave the first demo of Wolfram|Alpha, coming in May, what he modestly describes as a system to make our stock of human knowledge computable. It includes not just facts, but also our algorithmic knowledge. He says, “Given all the methods, models ,and equations that have been created from science and analysis – take all that stuff and package it so that we can walk up to a website and ask it a question and have it generate the knowledge that we want. … like interacting with an expert.”

It’s ambitious, but so are Wolfram’s previous projects: Mathematica and Mathworld. I remember relying on Mathworld as a grad student – it was excellent, and so I remember when it suddenly disappeared when the content was to be published as a book. In 2002 he published A New Kind of Science, arguing that all processes, including thought, can be viewed as computations and a simple set of rules can describe a complex system. This thinking is clearly evident in Wolfram|Alpha and here are some key examples.
Continue reading ‘Wolfram|Alpha Demoed at Harvard: Limits on Human Understanding?’

Stuart Shieber and the Future of Open Access Publishing

Back in February Harvard adopted a mandate requiring its faculty member to make their research papers available within a year of publication. Stuart Shieber is a computer science professor at Harvard and responsible for proposing the policy. He has since been named director of Harvard’s new Office for Scholarly Comminication.

On November 12 Shieber gave a talk entitled “The Future of Open Access — and How to Stop It” to give an update on where things stand after the adoption of the open access mandate. Open access isn’t just something that makes sense from an ethical standpoint, as Shieber points out that (for-profit) journal subscription costs have risen out of proportion with inflation costs and out of proportion with the costs of nonprofit journals. He notes that the cost per published page in a commercial journal is six times that of the nonprofits. With the current library budget cuts, open access — meaning both access to articles directly on the web and shifting subscriptions away from for-profit journals — is something that appears financially unavoidable.

Here’s the business model for an Open Access (OA) journal: authors pay a fee upfront in order for their paper to be published. Then the issue of the journal appears on the web (possibly also in print) without an access fee. Conversely, traditional for-profit publishing doesn’t charge the author to publish, but keeps the journal closed and charges subscription fees for access.

Shieber recaps Harvard’s policy:

1. The faculty member grants permission to the University to make the article available through an OA repository.

2. There is a waiver for articles: a faculty member can opt out of the OA mandate at his or her sole discretion. For example, if you have a prior agreement with a publisher you can abide by it.

3. The author themselves deposits the article in the repository.

Shieber notes that the policy is also because it allows Harvard to make a collective statement of principle, systematically provide metadata about articles, it clarifies the rights accruing to the article, it allows the university to facilitate the article deposit process, it allows the university to negotiate collectively, and having the mandate be opt out rather than opt in might increase rights retention at the author level.

So the concern Shieber set up in his talk is whether standards for research quality and peer review will be weakened. Here’s how the dystopian argument runs:

1. all universities enact OA policies
2. all articles become OA
3. libraries cancel subscriptions
4. prices go up on remaining journals
5. these remaining journals can’t recoup their costs
6. publishers can’t adapt their business model
7. so the journals and the logistics of peer review they provide, disappear

Shieber counters this argument: 1 through 5 are good because journals will start to feel some competitive pressure. What would be bad is if publishers cannot change their way of doing business. Shieber thinks that even if this is so it will have the effect of pushing us towards OA journals, which provide the same services, including peer review, as the traditional commercial journals.

But does the process of getting there cause a race to the bottom? The argument goes like this: since OA journals are paid by the number of articles published they will just publish everything, thereby destroying standards. Shieber argues this won’t happen because there is price discrimination among journals – authors will pay more to publish in the more prestigious journals. For example, PLOS costs about $3k, Biomed Central about $1000, and Scientific Publishers International is $96 for an article. Shieber also makes an argument that Harvard should have a fund to support faculty who wish to publish in an OA journal and have no other way to pay the fee.

This seems to imply that researchers with sufficient grant funding or falling under his proposed Harvard publication fee subsidy, would then be immune to the fee pressure and simply submit to the most prestigious journal and work their way down the chain until their paper is accepted. This also means that editors/reviewers decide what constitutes the best scientific articles by determining acceptance.

But is democratic representation in science a goal of OA? Missing from Shieber’s described market for scientific publications is any kind of feedback from the readers. The content of these journals, and the determination of prestige, is defined solely by the editors and reviewers. Maybe this is a good thing. But maybe there’s an opportunity to open this by allowing readers a voice in the market. This could done through ads or a very tiny fee on articles – both would give OA publishers an incentive to respond to the preferences of the readers. Perhaps OA journals should be commercial in the sense of profit-maximizing: they might have a reason to listen to readers and might be more effective at maximizing their prestige level.

This vision of OA publishing still effectively excludes researchers who are unable to secure grants or are not affiliated with a university that offers a publication subsidy. The dream behind OA publishing is that everyone can read the articles, but to fully engage in the intellectual debate quality research must still find its way into print, and at the appropriate level of prestige, regardless of the affiliation of the researcher. This is the other side of OA that is very important for researchers from the developing world or thinkers whose research is not mainstream (see, for example, Garrett Lisi a high impact researcher who is unaffiliated with an institution).

The OA publishing model Shieber describes is a clear step forward from the current model where journals are only accessible by affiliates of universities who have paid the subscription fees. It might be worth continuing to move toward an OA system where, not only can anyone access publications, but any quality research is capable of being published, regardless of the author’s affiliation and wealth. To get around the financial constraints one approach might be to allow journals to fund themselves through ads, or provide subsidies to certain researchers. This also opens up the idea of who decides what is quality research.

Craig Newmark: "no vision, but I know how to keep things simple, and I can listen some"

Craig Newmark was visiting the Berkman Center today and he explained how founding Craiglist brought him to his current role as community organizer. But these are really the same, he says.

In 1994, Craig was working at Charles Schwab where he evangelized the net – figuring that this is the future of business for these types of firms. He showed people usenet newsgroups and The Well and he noticed people helping each other in very generous ways. He wanted to give back so he started a cc list for events in early 1995. He credits part of his success to the timing of this launch – early dot com boom. People were alwyas influential and for example suggested new categories etc. He was using pine for this and in mid 1995 he had 240 email addresses and pine started to break. He was going to call it SFevents, but people around him suggested CraigsList because it was a brand, and the list was more than events.

So he wrote some code to turn these emails into html and became a web publisher. At the end of 1997 3 events happened: CraigsList had one million page views per month (a billion in August 2004, now heading toward 13 billion per month), Microsoft Sidewalk approached him to run banner ads and he said no because he didn’t need the money, and then he was approached with the idea of having some of the site run on a volunteer basis. He went for volunteer help but in 1998 it didn’t work well since he wasn’t providing strong leadership for them. At the end of 1998 people approached him to fix this and so in 1999 he incorporated and hired Jim Buckmaster who continued the traditions of incorporating volunteer suggestions for the site, and maintained the simple design. Also in 1999 he decided to charge for job ads and to charge real estate agents (only apt brokers in NYC, which they requested to eliminate the perceived need to post and repost).

He has generalized his approach to “nerd values:” take care of yourself enough to live comfortably then after that you can start to focus on changing things.

After 2000 there was slow continuous progress, like the addition of more cities. He also says they made a mistake of anonymizing all email as a default. The idea was to protect against spammers, but people requested the choice, because there is personal branding in email. He notes conflicting feedback can be tough to deal with. For example people feel strongly about “backyard breeders” of pets and there was bickering that crossed into criminal harassment. He says this kind of thing is hard to deal with emotionally.

So why was CraigsList so successful? He claims it is their business model… and a culture of trust. Bad guys are a tiny percentage of the pop and people look out for each other. For example, the flagging mechanism (a post is removed automatically if many people flag it). How did they build this culture of trust? Craig says it was by acting on shared values from the beginning, ie golden rule, especially in customer service, and live and let live and to be forgiving and give breaks. They are still trying to listen to people although novel suggestions are rare – the biggest decisions are which new cities to include.

He still runs pine as the primary email tool. He says it keeps down RSI because it minimizes point and click.

Newmark sees himself as a community grassroots organizer: organizing people in mundane ways. So he has capitalized on this to help in other ways beyond CraigsList. He doesn’t see anything about CraigsList as philanthropic, but he wants to extend this approach to help in the future of the media. For example face to face communication doesn’t scale on the Internet, but democracy is best facilitated through in person communication. So Craig sees the Internet as a great facilitator of face to face communication. He believes 2009 is the new 1787! This is about accountability and transparency – exposing everything the government is doing to sanitize it.

Another quip of advice from Craig: socialize more than he did as an undergrad – he says he got a better education than he needed and would have been better off spending more time socializing.

Crossposted on Berkman’s I&D Blog

Sunstein speaks on extremism

Cass Sunstein, Professor at Harvard Law School, is speaking today on Extremism: Politics and Law. Related to this topic, he is the author of Nudge, 2.0, and Infotopia. He discussed Republic 2.0 with Henry Farrell on this diavlog, which touches on the theme of extremism in discourse and the web’s role is facilitating polarization of political views (notably, Farrell gives a good counterfactual to Sunstein’s claims, and Sunstein ends up agreeing with him).

Sunstein is in the midst of writing a new book on extremism and this talk is a teaser. He gives us a quote from Churchill: “Fanatics are people who can’t change their minds and will not change the subject.” Political scientist Hardin says he agrees with the first clause epistemologically but the second clause is wrong because they *cannot* change the subject. Sunstein says extremism in multiple domains (The Whitehouse, company boards, unions) results from group polarization.

He thinks the concept of group polization should replace the notion of group think in all fields. Group Polarization involves both information exchange and reputation. His thesis is that like-minded people talking with other like-minded people tend to move to more extreme positions upon disucssion – partly because of the new information and partly because of the pressure from peer viewpoints.

His empirical work on this began with his Colorado study. He and his coauthors recorded the private views on 3 issues (climate change, same sex marriage and race conscious affirmative action) for citizens in Boulder and for citizens in Colorado Springs. Boulder is liberal so they screened people to ensure liberalness: if they liked Cheney they were excused from the test. They asked the same Cheney question in Colorado Springs and if they didn’t like him they were excused. Then he interviewed them to determine their private view after deliberation, and well as having come to a group consensus.
Continue reading ‘Sunstein speaks on extremism’

A2K3: Communication Rights as a Framework for Global Connectivity

In the last A2K3 panel, entitled The Global Public Sphere: Media and Communication Rights, Seán Ó Siochrú made some striking statements based on his experience building local communication networks in undeveloped areas of LCDs. He states that the global public sphere is currently a myth, and what we have now is elites promoting their self-interest. He criticizes the very notion of the global public sphere – he wants a more dynamic and broader term that reflects the deeper issues involved in bringing about such a global public sphere. He prefers to frame this issue in terms of communication rights. By this he means the right to sek and receive ideas, generate ideas and opinions of one’s own, speaks these ideas, have a right to be heard, and a right to have others listen. These last two rights Ó Siochrú dismisses as trivial but I don’t see that they are. Each creates a demand on others’ time that I don’t see how to effectuate within the framework of respect for individual automony Belkin elucidated in his keynote address and discussed in my recent blog post and on the A2K blog.

Ó Siochrú also makes an interesting point that if we are really interested in facilitating communication and connection between and by people who have little connectivity today, we are best to concentrate on technologies such as the radio, email, mobile phones, the television, or whatever works at the local level. He eschews blogs, and the internet, as the least acessible, least affortable, and the least usable.

Legal Barriers to Open Science: my SciFoo talk

I had an amazing time participating at Science Foo Camp this year. This is a unique conference: there are 200 invitees comprising some of the most innovative thinkers about science today. Most are scientists but not all – there are publishers, science reporters, scientific entrepreneurs, writers on science, and so on. I met old friends there and found many amazing new ones.

One thing that I was glad to see was the level of interest in Open Science. Some of the top thinkers in this area were there and I’d guess at least half the participants are highly motivated by this problem. There were sessions on reporting negative results, the future of the scientific method, reproducibility in science. I organized a session with Michael Nielsen on overcoming barriers in open science. I spoke about the legal barriers and O’Reilly Media has made the talk available here.

I have papers forthcoming on this topic you can find on my website.

A2K3: Technological Standards are Public Policy

Laura DeNardis, executive director of Yale Law School’s Information Society Project, spoke during the A2K3 panel on Technologies for Access. She makes the point that many of our technological standards are being made behind closed doors and by private, largely unaccountable, parties such as ICANN, ISO, the ITU, and other standards bodies. She advocates the concept of Open Standards, which she defines in a three-fold way as open in development, open in implementation, and open in usage. DeNardis worries that without such protections in place stakeholders can be subject to a standard they were not a party to, and this can affect nations in ways that might not be beneficial to them, particularly in areas such as civil rights, and especially for less developed countries. In fact, Nnenna Nwakanma in the audience comments that even when countries appears to be involved, their delegations are often comprised of private companies and are not qualified. For example, she says that there are only three countries in Africa that have people with the requisite techinical expertise in such state standards councils and that the involvment process is far from transparent. DeNardis also mentions the Dynamic Coalition on Open Standards designed to preserve the open architecture of the internet, with the Yale ISP is involved in advocacy at the Internet Governance Forum. DeNardis powerfully points out that standards are very much public policy, as much as the regulation we typically think of as public policy.

A2K3: Tim Hubbard on Open Science

In the first panel at A2K3 on the history, impact, and future of the global A2K movement, Tim Hubbard, a genetics researcher, laments that scientists tend to carry out their work in a closed way and thus very little data is released. In fact he claims that biologists used to deliberately mess up images so that they could not be reproduced! But apparently journals are more demanding now and this problem has largely been corrected (for example Nature’s 2006 standards on image fraud). He says that openness in science needs to happen before publication, the traditional time when scientists release their work. But this is a tough problem. Data must be released in such a way that others can understand and use it. This parallels the argument made in the opening remarks about the value of net neutrality as preserving an innovation platform: in order for data to be used it must be open in the sense that it permits further innovation. He says we now have Open Genome Data but privacy issues are pertinent: even summaries of the data can be backsolved to identify individuals. He asks for better encryption algorithms to protect privacy. In the meantime he proposes two other solutions. We could just stop worrying about the privacy of our genetic data, just like we don’t hide our race or gender. Failing that, he wants to mine the UK’s National Health Service’s patient records through an “honest broker” which is an intermediary that runs programs and scripts on the data that researchers submit. The data are hidden from the researcher and only accessed through the intermediary. Another problem this solves is the enormity of the released data that can prevent interested people from moving the data or analyzing it. This has broad implications as Hubbard points out – the government could access their CCTV video recordings to find drivers who’ve let their insurance lapse, but not track other possibly privacy violating aspects of drivers’ visible presence on the road. Hubbard is touching on what might be the most important part of the Access to Knowledge movement – how to make the access meaningful without destroying incentives to be open.

Book Review: The Cathedral and the Bazaar by Eric Raymond

I can’t believe I haven’t read this book until now since it intersects two areas of deep interest to me: technology (specifically programming) and freedom. Essentially the book celebrates liberty as a natural mode for creativity and productivity, with open source software as an example. Raymond has two further findings: that openness doesn’t necessarily imply a loss of property rights, and selfish motives are pervasive (and not evil).

When does open source work? And how about computational science?

Raymond’s biggest contribution is that he gives a wonderful analysis of the conditions that contribute to the success of the open approach, quoting from page 146 of the revised edition:

1. Reliability/stability/scalability are critical.
2. Correctness of design and implementation cannot readily be verified by means other than independent peer review.
3. The software is critical to the user’s control of his/her business.
4. The software establishes or enables a common computing and communication infrastructure.
5. Key methods (or functional equivalents of them) are part of common engineering knowledge.

I’m struck by how computational research seems to fit. Adapting Raymond’s list in this light:

1. Reproducibility of research is critical.
2. Correctness of methodology and results cannot readily be verified by means other than independent peer review.
3. The research is critical to academic careers.
4. Computational research may lead to common platforms since by its nature code is created, but this is not necessarily the case.
5. Key methods, such as the scientific method, are common research knowledge.

1, 2, and 3 seem fairly straightforward and a great fit for computational science. Computational research doesn’t tend to establish a common computing and communication infrastructure, although it can, for example David Donoho’s WaveLab or my colloborative work with him and others SparseLab. We were aiming not only to create a platform and vehicle for reproducible research but also to create software tools and common examples for researchers in the field. But at the moment this approach isn’t typical. In point 5, I think what Raymond means is that there is a common culture of how to solve a problem. I think computational scientists have this through their agreement on the scientific method for inquiry. Methodology in the computational sciences is undergoing rapid transformations – it is a very young field (for example, see my paper).

I think open source and computational research differ in their conception of openness. Implicitly I’ve been assuming opening to other computational researchers. But I can imagine a world that’s closer to the open source mechanism where people can participate based on their skill and interest alone, rather than school or group affiliation or similar types of insider knowledge. Fully open peer review and reproducible research would make these last criteria less important and go a long way to accruing the benefits that open source has seen.

Raymond notes that music and books are not like software in that they don’t need to be debugged and maintained. I think computational scientific research can bridge this in a way those areas don’t – the search for scientific understand requires cooperation and continual effort that builds on the work that has come before. Plus, there is something objective we’re trying to describe when we do computational research, so we have a common goal.

Property Rights – essential to a productive open system

The other theme that runs through the book is Raymond’s observation that openness doesn’t necessarily mean a loss of property rights, in fact they may be essential. He goes to great lengths to detail the community mores that enforce typical property right manifestations: attribution for work, relative valuation of different types of work, and boundary establishment for responsibility within projects for example. He draws a clever parallel between physical property rights as embodied in English common law (encoded in the American legal system) and the similarly self-evolved property rights of the open source world. John Locke codified the Anglo-American common law methods of acquiring ownership of land (page 76):

1. homesteading: mixing one’s labor with the unowned land and defending one’s title.
2. transfer of title.
3. adverse possession: owned land can be homesteaded and property rights acquired if the original owner does not defend his claim to the land.

A version of this operates in the hacker community with regard to open source projects. By contributing to an open source project you mix your labor in Lockean fashion and gain part of the project’s reputation return. The parallel between the real and virtual worlds is interesting – and the fact that physical property rights appear to be generalizable and important for conflict avoidance in open source systems. Raymond also notes that these property rights customs are strictly enforced in the open source world through moral suasion and the threat of ostracization.

The Role of Selfishness in Open Cultures

Raymond uses the open source world as an example of the pervasiveness of selfish motives in human behavior, stating that as a culture we tend to have a blind spot to how altruism is in fact “a form of ego satisfaction for the altruist.” (p53) This is an important point in this debate because the idea of open source can be conflated with a diminution of property rights and a move toward a less capitalist system, ie. people behaving altruistically to each other rather than according to market strictures. Raymond eviscerates this notion by noting that altruism isn’t selfless and even the open source world benefits by linking the selfishness of hackers and their need for self-actualization to difficult ends that can only be achieved through sustained cooperation. Appeals to reputation and ego boosting seem to do the trick in this sphere and Raymond attributes Linux’s success in part to Linus Torvalds’ genius in creating an efficient market in ego boosting – turned of course to the end of OS development.

Vacations or "Vacations" :)

I’m here at the Global Voices Summit in Budapest and I just listened to a panel on Rising Voices, a group within Global Voices dedicated to supporting the efforts of people traditionally underrepresented in citizen media. (See their trailer here). At the end of the panel, the question was asked ‘how can we help?’ The answer was perhaps surprising: although money is always welcome what is needed is skills. Specifically, people with web design or IT skills can come and stay with a blogging community for a week or two and teach people how to do things like design a web page, display their wares online, essentially support people in computer use… So, it occurred to me that I know many people for whom travel and learning are very important, who are both skilled in IT and would find an enormous satisfaction from having a purpose to their travel. I can put you in touch with people who might appreciate your skills, or you can reach Rising Voices directly. Another group that’s similar is spirit and might be able to facilitate this is Geek Corps.

Internet and Cell Phone Use in the Middle East

When people talk about the Internet and Democracy, especially in the context of the Middle East, I wonder just how pervasive the Internet really is in these countries. I made a quick plot of data for Middle Eastern countries from data I downloaded from the International Telecommunciation Union:

The US is the blue line on top, for reference. UAE is approaching American levels of internet use, and Iran has skyrocketed since 2001, and is now the 3rd or 4th most wired country. There seem to be a cluster of countries that, while adopting, are doing so slowly: Saudi Arabia, Egypt, Syria, Oman, the Sudan, and Yemen, although Saudi Arabia and Syria seem to be accelerating since 2005.

I made a comparable plot of cellphone use per 100 inhabitants for these same countries, also from data provided by the ITU:

In this graph the United States is in the middle of the pack and growing steadily, but definitely not matching the recent subscription growth rates in the UAE, Qatar, Bahrain, Saudi Arabia, and Oman (data for 2007 for Israel is not yet available). For most countries, cell phones subscriptions are more than three times as prevalent as Internet users. Interestingly, the group of countries with low internet use also have low cell phone use, but unlike for the internet their cell phone subscription rates all began accelerating in 2005.

So what does this mean? All countries in the Middle East are growing more quickly in adopting cell phones than the internet, with the interesting exception of Iran (I don’t know why the growth rate of internet use in Iran is so high, perhaps blogging has caught on more here. Although it doens’t address this question directly, the Iranian blogosphere itself is analyzed in the Berkman Internet & Democracy paper Mapping Iran’s Online Public: Politics and Culture in the Persian Blogosphere). Syria, the Sudan, Yemen, and Iran have grown most quickly in both internet use and cell phone subscription. In 6 countries there is more than one cell phone subscription per person – conversely, the highest rate for internet use (other than the US) is 50% in UAE with the other countries in approximately two clusters of about 30% and about 10% each. With the rates of growth on the side of the cell phones, I doubt we’ll see their pervasiveness relative to internet use change in the next few years, in fact the gap will probably widen.

Cross-posted at I&D Blog

Lessig stars at the Stanford FCC hearing

After Comcast admitted to stuffing seats at the FCC hearing at Harvard Law School February 24th, the FCC decided another hearing was necessary. They chose to hold it at Stanford April 17 and I’m watching the FCC’s videocast of the event, which is oddly appropriate, since the focus of the hearing is video on the internet.

After an introduction by Stanford Law School Dean Larry Kramer, FCC Chairman Martin explained that every ISP, excepting Lariat Networks from Lariat, Wyoming, was invited and declined to attend this hearing: Comcast, Verizon, Time/Warner, and AT&T. Comcast has stated it is working with an industry consortium on a Consumer Bill of Rights. The hearing begins with each of the FCC commissioners making a statement, then proceeds through panels and then opens to questions.

Commissioner Copps states that a free internet is a requirement for the type of growth, a fact we’ve seen from Silicon Valley. If network operators consolidate their control, which is more likely with fewer network operators, they’ll prevent inventors from bringing their innovations to consumers and make investing more risky. So Copps wants to eliminate and punish discrimination.

Indicating how huge this issue has become, Commissioner Adelstein states that 45k dockets were filed with the FCC for this hearing, and the vast majority of them came from public citizens. He warns that the recent consolidation across internet providers from the backbone to the largest service providers will lead to more FCC regulation. He advocates greater competition in the broadband market place since 90% is dominated by cable and telephone companies. This gives the companies who control the “last mile” (the distance from the backbone to the consumer’s computer) the ability to discriminate over packets that reach end users. He’s concerned about allegations like Verizon’s refusal to send pro-life text messages and AT&T’s censoring of Pearl Jam online. He would like a 5th principle on the FCC policy statement to address this as well as enforcement and compliance. Broadband providers should declare in clear plain English what their policies are.

Commissioner Tate applauds the industry-wide effort to create a bill of rights for P2P users and ISPs. She has a strong preference for industry based collaborative solutions over direct regulation.

Commissioner McDowell wants to ensure that the FCC takes the anticompetitive allegations, such as the text messaging one, seriously. Comcast is alleged to have manipulated packet allocation of video – video is something Comcast provides and runs the pipes for other competitor, so Comcast appears to discriminate against
bit torrent for anticompetitive reasons not just for traffic management. McDowell, like Commissioner Tate, would like to see the industry develop is own solutions to these problems such as what might come from the industry consortium Comcast is involved in and says “engineers should solve engineering problems not politicians.”

Chairman Martin states the four principles the FCC adopted in August 2005 in their internet policy statement (“Powell’s Four Freedoms”).

1. Consumers are entitled to access the lawful Internet content of their choice;
2. Consumers are entitled to run applications and services of their choice, subject to the needs of law enforcement;
3. Consumers are entitled to connect their choice of legal devices that do not harm the network; and
4. Consumers are entitled to competition among network providers, application and service providers, and content providers.

Larry Lessig, Professor at Stanford Law School, is the first speaker on the first panel.

Lessig reminds us that companies are out to make profit and we shouldn’t trust them with public policy. The architecture of the internet has given us openness, transparency, and freedom and in a market with few firms, they can manipulate this architecture to weaken competition. It is important to note that the original openness of the internet has given us an enormous amount of economic growth – he likens the process to the electricity grid: it is transparent and open and anyone can do anything on it, as long as you know the protocols. It doesn’t ask if the TV you plug in is Panasonic or Sony and doesn’t allocate electricity based on that info. He advocates that for us to depart from this model requires a very strong demonstration that the proposed change will advance economic growth and that competition will continue.

We can’t just wait and see, says Lessig – witness the text messaging and bit torrent problems we have already. He reiterates the argument that venture capitalists needs stability about the vision of the future in order to invest. Thus the FCC needs to make a clear policy statement that net neutrality is a core principle of the internet infrastructure. In fact, Lessig says, the failure of the FCC to create
a clear policy about this is the reason for the hearing today. So the FCC needs to regulate things it understands, but is that sufficient to assure that what happens at the network level doesn’t destroy neutrality? Lessig gives two examples of such regulation, calling them “Powell’s Four Freedoms Plus.”

Plus 1) The zero price regulation: this is built into Representative Markey’s proposed bill: if data are prioritized, all data of that type must be prioritized without a surcharge. Lessig is against this: this blocks productive discrimination and so stops spread of broadband and thus growth. For example, iFilm wants fast pipes and he doesn’t care for email so these services can be differently prioritized, but iFilm’s competitors should find themselves subject to different discrimination practices by the provider.

Plus 2) Zero discrimination surcharge rules. Discrimination surcharge occurs if you have a provider that says Google pays x but iFilm pays 2x. Lessig explains this is a problem because it creates an incentive for a destructive business model such that the provider can inflate the premium price by maintaining scarcity in ordinary network provisions. This rule does allow for nondiscriminatory tiered pricing: ie. a surcharge for video but everyone pays the same price for that video privilege. Lessig’s advice is that the FCC should start here with a target of getting to broadband as a commodity like wheat – where there the market is characterized by fundamental competition in the provision of the commodity which drives the price down.

The role of net neutrality in FCC regulation. Lessig thinks net neutrality should be a very central principle, but a heavy weight and not an absolute bar. This means that countervailing notions that don’t compromise the incentive to produce open networks are ok.

When asked a question about how the commission should respond to claims that customers get less broadband then they pay for, Lessig says “the most outrageous thing about this story is you can’t get the facts straight.” He says if there were penalties for a company that misrepresents what’s going on during an investigation there would be more clarity right now.

Lessig explains that even if there were sufficient competition this is not enough to ensure net neutrality. He cites Barbara van Schewick, who is an assistant professor at Stanford Law School, co-director (with Lessig) of the Center for Internet and Society at Stanford Law School and an upcoming panelist.

von Schewick claims that markets won’t solve the problem of content discrimination on the internet. Consumers need to have in depth and standardized disclosure, and even this is not enough because there are market failures. Providers have the incentive to block applications that use lots of bandwidth and don’t translate into higher profits. This harms application innovation, aside from discouraging investment since the blocking behavior is unpredictable. Network providers need to manage networks in a nondiscriminatory way.

Robb Topolski, a panelist and Software Quality Engineer, says tests he has done show that Comcast was blocking packets at 1:45am rather than at times of congestion like they claim. Topolski also notes that, there is a general complaint form provided by the FCC but no one knows about it. He also notes that routers manage network traffic on their own – it may not be optimal but it would be better than waiting on the provider industry to self-regulate. Interestingly, consumers seem to be testing networks themselves and tools are even appearing to monitor
cellphone use by consumer (see, the company started by panelist Jason Devitt).

Crossposted on I&D Blog

David Weinberger: How new technologies and behaviors are changing the news

David Weinberger is a fellow and colleague of mine at the Berkman Center and is at Berkman’s Media Re:public Forum discussing the difference the web is making to journalism: “what’s different about the web when it comes to media and journalism?”

Weinberger is concerned with how we frame this question. He prefers ‘ecosystem’ rather than ‘virtue of discomfort’ since this gets at the complexity and interdependence in online journalism. But the ecosystem analogy is too apt and too comforting and all-encompassing so he pushes further. He doesn’t like the ‘pro-amateur’ analogy since it focuses too much on money as the key difference in web actors, and yet somehow seems to understate the vast disparity in money and funding. The idea of thinking of news as creating a better informed citizenry so that we get a better democracy doesn’t go far enough – Weinberger notes that people read the news for more reasons than this.

So he settles on ‘abundance’ as a frame due to the fact that control doesn’t scale which is something being address currently with online media. “Adundance of crap is scary but abundance of good stuff is terrifying!” The key question is how to deal with this. We are no longer in a battle over the front page since other ways of getting information are becoming more salient. For example, Weinberger notes that “every tag is a front page” and email recommendations often become our front page. He sees this translating into a battle over metadata – the front page is metadata, authority is metadata – and we are no longer struggling over content creation. So we create new tools to handle metadata – in order to show each other what matters and how it matters. Tools such as social networks and the sematic web. All these tools unsettle knowledge and meaning (knowledge and meaning that has not been obvious but was always there).

Crossposted on I&D Blog

Robert Suro: Defining the qualities of information our democracy needs

Robert Suro is a professor of journalism at USC and spoke today at Berkman’s Media Re:public Forum. His talk concerns journalism’s role in democratic processes and he draws two distinctions in how we think about journalism that often get conflated: journalism is a business but also a social actor. he points out that when main stream media’s profitability decline we shouldn’t make the mistake of assuming its impact of in the democratic arena declines as well.

He also has trouble with the term “participatory media” and draws a distinction between the study of who is participating and what means they use (his definition of participatory media) and “journalism of participation” which evaluates the media in terms of a social actor – the object is effective democratic governance. He is worried these two concepts get confused and people can mistakenly equate the act of participating in the media, for example adding comments to a web site, with effective participation in the democratic process.

The result of this distinctions is that if you want to assess participatory media in terms of social impact you have to study more than who they are and what they produce but also whether this activity is engendering civic engagement that makes democracy more representative and government more effective.

Suro notes that this isn’t new: he hypothesizes that journalism doesn’t change often but when it does it is a big change, and we’re in the middle of just such a change right now. As an example of a previous change he gives the debate between two editors who were interested in the creation of civil society. One was supported by Jefferson and Madison and the other by Hamilton and Adams. Both were partisan in what they said and who funded them and both were committed to democracy but understood the role of the state differently, resulting in the creation of the democratic and republican parties. Although both would be fired as editors today there is a long history of social democratic results in journalism and the fundamental role of journalism in a democratic society is subject to change. We should study the ongoing redefinition and try and understand causality and impact.

Suro also thinks the Lippmann/Dewey argument about whether the goal of journalism should be to produce highly informed elites or mobilize the masses and create informed debate is alive and well. He suggests we have always produces a mix of these outcomes and will inevitably continue to do so, but now we have the address the mix of journalistic processes. He thinks the right way to look at this is to asses what outcome to they produce in terms of quality of leadership. Suro also touches on Cass Sunstein’s polarization concern in that is will produce less effective governance: we need to understand how a mix of new and old media can create a megaphone that artificially amplifies a voice that might not be the most effective.

Crossposted on I&D Blog

Richard Sambrook at the Media Re:public Forum

I’m at Berkman’s Media Re:public Forum and Richard Sambrook, director of Global News at the BBC is giving the first talk. He is something of a technological visionary and his primary concern is with how technology is affecting the ability of, not only traditional media but anyone, to set the international news agenda.

The model that news stories may break on the blogs and travel to main stream media seems incomplete to Sambrook and he hopes to use the news audience to develop the agenda in an interactive way through network journalism. An example he gives is how the BBC puts their NewsNight show’s agenda online in the morning and invites people to comment on the choice of stories and angles they are taking on them. But this seems quite small, and as Ethan Zuckerman points out in a question, not much of a change in paradigm: Zuckerman laments that main stream media is trying to involve the public on their terms and in their way, through site-hosted comments and being quite closed about sharing their content. Sambrook explains this as slowness of cultural change at organizations like the BBC and is changing. For example, BBC video can now be hosted on any site. Sambrook is also worried that they just can’t seem to find the audience – the right people to engage with in various areas. He notes that the top ten sites (Google, Yahoo, Wikipedia Fox News etc) control 1 billion eyeballs. He doesn’t think current business models are sustainable and perhaps energy should be directed into a different metric than eyeballs to more accurately measure engagement and be able to monetize it.

Sambrook notes that across main stream media it is well understood that the future of news is online but there are cultural legacies within main stream media and even where there aren’t solutions to new problems aren’t obvious. Sambrook gives the example of the BBC’s river boat trip through Bangladesh. They experimented with several ways of reaching potentially interested audiences: twitter, google maps to track the boat, images on Flickr, radio and traditional news. They had 26 followers on twitter and 50k on Flicker but millions on the radio. This highlights the difficulty news outlets are having reaching their audience – the methods chosen are key, and how to do this is not obvious.

Sambrook says that he sees an upcoming tipping point for the data-driven web, or semantic web, in news applications. For Sambrook, this manifests as an improvement in the personalization of news. He mentions the BBC’s dashboard tool – a way to pull content from all over the BBC’s website to suit your interests and tastes. He is also concerned about the tension with agenda setting: “who is the curator of the kind of news you are interested in?” This also brings to mind Cass Sunstein’s polarization critique of the internet, especially for news delivered online – that we will only seek out news that fundamentally agrees with our own opinions and create echo chambers in which we never hear opposing thinking and thus open discourse and debate becomes stultified. He seems to see the future as communication within communities and he frames the problem as finding the right community and getting them involved in an effective way.

Crossposted at I&D Blog

Reducing Election Violence Cheaply – eVoting?

I can’t help but notice the violence surrounding the recent elections in Kenya, Pakistan, Zimbabwe (where I still have family) and many other places. To the extent that the problem is citizen mistrust of the voting process, this seems like an effective place to direct aid resources and energy. Why not fund, with the host country’s cooperation, open source election machines similar to those used in Australia? The Australian approach allows people to inspect the machine’s software if unsatisfied about the machine’s ability to count votes. Each machine is linked to a server via a secure local network so that information is not transmitted openly and a printout of the vote could be made and deposited in a ballot box to verify the electronic results if necessary.

Ethan Zuckerman suggested to me that one way to potentially keep the cost low would be to use SMS and have the machine send back periodic vote tallies throughout the voting period. This way there is no need to set up network infrastructure, since a cellphone system capable of handling this kind of traffic already exists across most countries. Secure SMS is an available technology and it might be straightforward to ensure a secure transmission for vote tallies. The average cost of a voting machine in the US is $3000, and the Australian ones cost about $750 each. Australia used 80 machines for their capital territory of Canberra which has about 325,000 people, approximately 4000 people per machine. So in Zimbabwe for example, with a population of about 1.3 million, they would need 325 machines. If each machine is even as much as $3000 that’s still less than a million dollars. Although I expect in many of the countries, including Zimbabwe, that would benefit from such a system, deployment would include more rural areas than Canberra and more machines would be necessary, but this back-of-the-envelope sketch makes it seem reasonably inexpensive and technically feasible.

Of course, this will only quell violence in so far as it is based in the perception of an unfair voting system. If the violence is thuggery bent on subverting fair electoral results, or garnering attention, then voting machines won’t stop it, although the transparency of this system might make it harder to promulgate an inflammatory mindset of corruption.

Crossposted on I&D Blog