Archive for the 'Statistics' Category

The nature of science in 2051

Right now scientific questions are chosen for study in a largely autocratic way. Typically grants for research on particular questions come from federal funding agencies, and scientists competitively apply with the money going to the chosen researcher via a peer review process.

I suspect, as the tools of online science become increasingly available, the real questions people face in their day to day lives will be more readily answered. If you think about all the things you do and decisions you make in a day, many of them don’t have a strong empirical basis. How you wash the dishes or do laundry, what foods are healthy, what environment to maintain in your house, what common illness remedies work best, who knows, but these types of questions, the ones that occur to you as you go about your daily business, aren’t prioritized in the investigatory model we have now for science. I predict that scientific investigation as a whole, not just that that is government funded, will move substantially toward providing answers to questions of local importance.

Generalize clinicaltrials.gov and register research hypotheses before analysis

Stanley Young is Director of Bioinformatics at the National Institute for Statistical Sciences, and gave a talk in 2009 on problems in modern scientific research. For example: 1 in 20 NIH-funded studies actually replicates; closed data and opacity; model selection for significance; multiple comparisons.. Here is the link to his talk: Everything Is Dangerous: A Controversy. There are a number of good examples in the talk and Young anticipates and is more intellectually coherent than the New Yorker article The Truth Wears Off if you were interested in that.

Idea: Generalize clinicaltrials.gov, where scientists register their hypotheses prior to carrying out their experiment. Why not do this for all hypothesis tests? Have a site where the hypotheses are logged and time stamped before researchers gather the data or carry out the actual hypothesis testing for the project. I’ve heard this idea mentioned occasionally and both Young and Lehrer mentions it as well.

My Symposium at the AAAS Annual Meeting: The Digitization of Science

Yesterday I held a symposium at the AAAS Annual Meeting in Washington DC, called “The Digitization of Science: Reproducibility and Interdisciplinary Knowledge Transfer,” that was intended to bring attention to how massive computation is changing the practice of science, particularly the lack of reproducibility of published computational scientific results. The fact is, most computational scientific results published today are unverified and unverifiable. I’ve created a page for the event here, with links to slide decks and abstracts. I couldn’t have asked for a better symposium, thanks to the wonderful speakers.

The first speaker was Keith A. Baggerly, who (now famously) tried to verify published results in Nature Medicine and uncovered a series of errors that led to the termination of clinical trials at Duke that were based on the original findings, and the resignation of one of the investigators (his slides). I then spoke about policies for realigning the IP framework scientists are under with their longstanding norms, to permit sharing of code and data (my slides). Fernando Perez described how computational scientists can learn about not only code sharing, quality control, and project management from the Open Source Software, but how they have in fact developed what is in effect a deeply successful system of peer review for code. Code is verified line by line before incorporated into the project, and there are software tools to enable the communication between reviewer and submitted, down to the line of code (his slides).

Michael Reich then presented GenePattern, an OS independent tool developed with Microsoft for creating data analysis pipelines and incorporating them into a Word doc. Once in the document, tools exist to click and recreate the figure from the pipeline and examine what’s been done to the data. Robert Gentlemen advocated the entire research paper as the unit of reproducibility, and David Donoho presented a method for assigning a unique identifier to figures within the paper, that creates a link for each figure and permits its independent reproduction (the slides). The final speaker was Mark Liberman, who showed how the human language technology community had developed a system of open data and code in their efforts to reduce errors in machine understanding of language (his slides). All the talks pushed on delineations of science from non-science, and it was probably best encapsulated with a quote Mark introduced from John Pierce, a Bell Labs executive in 1969, how “To sell suckers, one uses deceit and offers glamor.”

There was some informal feedback, with a prominent person saying that this session was “one of the most amazing set of presentations I have attended in recent memory.” Have a look at all the slides and abstracts, including links and extended abstracts.

Update: Here are some other blog posts on the symposium: Mark Liberman’s blog and Fernando Perez’s blog.

Startups Awash in Data: Quantitative Thinkers Needed

We know unix logs everything, which makes web-based data collection easy, in fact almost difficult not to do. As a result internet startups often find themselves gathering enormous amounts of data, for example site use patterns, click-streams, user demographics and preference functions, purchase histories… Many of these companies know they are sitting on a goldmine, but how to extract the relevant information from these scads of data? More precisely, how to predict user behavior and preferences better?

Statisticians, particularly through machine learning, have been working on this problem for a long time. Since I’ve arrived in New York City from Silicon Valley I’ve observed an enormous amount of quantitative talent here, at least in part due to the influence of the finance industry. But these quantitative skills are precisely what’s needed to make sense of the data collected by startups, and here it looks like NYC has an edge over Silicon Valley. Friends Evan Korth, Hilary Mason, and Chris Wiggins (two professors and a former professor) are building bridges to connect these two worlds. Their primary effort, HackNY, is a summer program linking students with quantitative talent with startups in need. (Wiggins’ mantra is to “get the kids off the street” by giving them alternatives to entering the finance profession.)

The New York startup scene is distinguishing itself from Silicon Valley by efforts to make direct use of the abundance of quantitative skills available here. Hilary and Chris created an excellent guideline for data-driven analysis in the startup context, “A Taxonomy of Data Science:” Obtain, Scrub, Explore, Model, and iNterpret. These data are often measuring phenomena in new ways, using novel data structures, and providing new opportunities for innovative data research and model building. Lots of data, lots of skill – great for statisticians and folks with an interest in learning from data, as well as for those collecting the data.

Open Data Dead on Arrival

In 1984 Karl Popper wrote a private letter to an inquirer he didn’t know, responding to enclosed interview questions. The response was subsequently published and in it he wrote, among other things, that:

“Every intellectual has a very special responsibility. He has the privilege and opportunity of studying. In return, he owes it to his fellow men (or ‘to society’) to represent the results of his study as simply, clearly and modestly as he can. The worst thing that intellectuals can do — the cardinal sin — is to try to set themselves up as great prophets vis-a-vis their fellow men and to impress them with puzzling philosophies. Anyone who cannot speak simply and clearly should say nothing and continue to work until he can do so.”

Aside from the offensive sexism in referring to intellectuals as males, there is another way this imperative should be updated for intellectualism today. The movement to make data available online is picking up momentum — as it should — and open code is following suit (see http://mloss.org for example). But data should not be confused with facts, and applying the simple communication that Popper refers to beyond the written or spoken word is the only way open data will produce dividends. It isn’t enough to post raw data, or undocumented code. Data and code should be considered part of intellectual communication, and made as simple as possible for “fellow men” to understand. Just as knowledge of adequate English vocabulary is assumed in the nonquantitative communication Popper refers to, certain basic coding and data knowledge can be assumed as well. This means the same thing as it does in the literary case; the elimination of extraneous information and obfuscating terminology. No need to bury interested parties in an Enron-like shower of bits. It also means using a format for digital communication that is conducive to reuse, such as a flat text file or another non-proprietary format, for example pdf files cannot be considered acceptable to either data or code. Facilitating reproducibility must be the gold standard for data and code release.

And who are these “fellow men”?

Well, fellow men and women that is, but back to the issue. Much of the history of scientific communication has dealt with the question of demarcation of the appropriate group to whom the reasoning behind the findings would be communicated, the definition of the scientific community. Clearly, communication of very technical and specialized results to a layman would take intellectuals’ time away from doing what they do best, being intellectual. On the other hand some investment in explanation is essential for establishing a finding as an accepted fact — assuring others that sufficient error has been controlled for and eliminated in the process of scientific discovery. These others ought to be able to verify results, find mistakes, and hopefully build on the results (or the gaps in the theory) and thereby further our understanding. So there is a tradeoff. Hence the establishment of the Royal Society for example as a body with the primary purpose of discussing scientific experiments and results. Couple this with Newton’s surprise, or even irritation, at having to explain results he put forth to the Society in his one and only journal publication in their journal Philosophical Transactions (he called the various clarifications tedious, and sought to withdraw from the Royal Society and subsequently never published another journal paper. See the last chapter of The Access Principle). There is a mini-revolution underfoot that has escaped the spotlight of attention on open data, open code, and open scientific literature. That is, the fact that the intent is to open to the public. Not open to peers, or appropriately vetted scientists, or selected ivory tower mates, but to anyone. Never before has the standard for communication been “everyone,” in fact quite the opposite. Efforts had traditionally been expended narrowing and selecting the community privileged enough to participate in scientific discourse.

So what does public openness mean for science?

Recall the leaked files from the University of East Anglia’s Climatic Research Unit last November. Much of the information revealed concerned scientifically suspect (and ethically dubious) attempts not to reveal data and methods underlying published results. Although that tack seems to have softened now some initial responses defended the climate scientists’ right to be closed with regard to their methods due to the possibility of “denial of service attacks” – the ripping apart of methodology (recall all science is wrong, an asymptotic progression toward to truth at best) not with the intent of finding meaningful errors that halt the acceptance of findings as facts, but merely to tie up the climate scientists so they cannot attend to real research. This is the same tradeoff as described above. An interpretation of this situation cannot be made without the complicating realization that peer review — the review process that vets articles for publication — doesn’t check computational results but largely operates as if the papers are expounding results from the pre-computational scientific age. The outcome, if computational methodologies are able to remain closed from view, is that they are directly vetted nowhere. Hardly an acceptable basis for establishing facts. My own view is that data and code must be communicated publicly with attention paid to Popper’s admonition: as simply and clearly as possible, such that the results can be replicated. Not participating in dialog with those insufficiently knowledgable to engage will become part of our scientific norms, in fact this is enshrined in the structure of our scientific societies of old. Others can take up those ends of the discussion, on blogs, in digital forums. But public openness is important not just because taxpayers have a right to what they paid for (perhaps they do, but this quickly falls apart since not all the public are technically taxpayers and that seems a wholly unjust way of deciding who shall have access to scientific knowledge and who not, clearly we mean society), but because of the increasing inclusiveness of the scientific endeavor. How do we determine who is qualified to find errors in our scientific work? We don’t. Real problems will get noticed regardless of with whom they originate, many eyes making all bugs shallow. And I expect peer review for journal publishing to incorporate computational evaluation as well.

Where does this leave all the open data?

Unused, unless efforts are expended to communicate the meaning of the data, and to maximize the usability of the code. Data is not synonymous with facts – methods for understanding data, and turning its contents into facts, are embedded within the documentation and code. Take for granted that users understand the coding language or basic scientific computing functions, but clearly and modestly explain the novel contributions. Facilitate reproducibility. Without this data may be open, but will remain de facto in the ivory tower.

Ars technica article on reproducibility in science

John Timmer wrote an excellent article called “Keeping computers from ending science’s reproducibility.” I’m quoted in it. Here’s an excellent follow up blog post by Grant Jacobs, “Reproducible Research and computational biology.”

My answer to the Edge Annual Question 2010: How is the Internet Changing the Way You Think?

At the end of every year editors at my favorite website The Edge ask intellectuals to answer a thought-provoking question. This year it was “How is the internet changing the way you think?” My answer is posted here:
http://www.edge.org/q2010/q10_15.html#stodden

Post 2: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the second wave of the OSTP’s call as posted here: http://www.ostp.gov/galleries/default-file/RFI%20Final%20for%20FR.pdf. The first wave, comments posted here and on the OSTP site here (scroll to the second last comment), asked for feedback on implementation issues. The second wave requests input on Features and Technology and Chris Wiggins and I posted the following comments:

We address each of the questions for phase two of OSTP’s forum on public access in turn. The answers generally depend on the community involved and (particularly question 7, asking for a cost estimate) on the scale of implementation. Inter-agency coordination is crucial however in (i) providing a centralized repository to access agency-funded research output and (ii) encouraging and/or providing a standardized tagging vocabulary and structure (as discussed further below).

Continue reading ‘Post 2: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government’

The Climate Modeling Leak: Code and Data Generating Published Results Must be Open and Facilitate Reproducibility

On November 20 documents including email and code spanning more than a decade were leaked from the Climatic Research Unit (CRU) at East Anglia University in the UK.

The Leak Reveals a Failure of Reproducibility of Computational Results

It appears as though the leak came about through a long battle to get the CRU scientists to reveal the code and data associated with published results, and highlights a crack in the scientific method as practiced in computational science. Publishing standards have not yet adapted to the relatively new computational methods used pervasively across scientific research today.

Other branches of science have long-established methods to bring reproducibility into their practice. Deductive or mathematical results are published only with proofs, and there are long established standards for an acceptable proof. Empirical science contains clear mechanisms for communication of methods with the goal of facilitation of replication. Computational methods are a relatively new addition to a scientist’s toolkit, and the scientific community is only just establishing similar standards for verification and reproducibility in this new context. Peer review and journal publishing have generally not yet adapted to the use of computational methods and still operate as suitable for the deductive or empirical branches, creating a growing credibility gap in computational science.

Verifying Computational Results without Clear Communication of the Steps Taken is Near-Impossible

The frequent near-impossibility of verification of computational results when reproducibility is not considered a research goal is shown by the miserable travails of “Harry,” a CRU employee with access to their system who was trying to reproduce the temperature results. The leaked documents contain logs of his unsuccessful attempts. It seems reasonable to conclude that CRU’s published results aren’t reproducible if Harry, an insider, was unable to do so after four years.

This example also illustrates why a decision to leave reproducibility to others, beyond a cursory description of methods in the published text, is wholly inadequate for computational science. Harry seems to have had access to the data and code used and he couldn’t replicate the results. The merging and preprocessing of data in preparation for modeling and estimation encompasses a potentially very large number of steps, and a change in any one could produce different results. Just as when fitting models or running simulations, parameter settings and function invocation sequences must be communicated, again because the final results are a culmination of many decisions and without this information each small step must match the original work – a Herculean task. Responding with raw data when questioned about computational results is merely a canard, not intended to seriously facilitate reproducibility.

The story of Penn State professor of meteorology Michael Mann‘s famous hockey stick temperature time series estimates is an example where lack of verifiability had important consequences. Release of the code and data used to generate the results in the hockey stick paper likely would have avoided the convening of panels to assess the papers. The hockey stick is a dramatic illustration of global warming and became something of a logo for the U.N.’s Intergovernmental Panel of Climate Change (IPCC). Mann was an author of the 2001 IPCC Assessment report, and was a lead author on the “Copenhagen Diagnosis,” a report released Nov 24 and intended to synthesize the hundreds of research papers about human-induced climate change that have been published since the last assessment by the IPCC two years ago. The report was prepared in advance of the Copenhagen climate summit scheduled for Dec 7-18. Emails between CRU researchers and Mann are included in the leak, which happened right before the release of the Copenhagen Diagnosis (a quick search of the leaked emails for “Mann” provided 489 matches).

These reports are important in part because of their impact on policy, as CBS news reports, “In global warming circles, the CRU wields outsize influence: it claims the world’s largest temperature data set, and its work and mathematical models were incorporated into the United Nations Intergovernmental Panel on Climate Change’s 2007 report. That report, in turn, is what the Environmental Protection Agency acknowledged it “relies on most heavily” when concluding that carbon dioxide emissions endanger public health and should be regulated.”

Discussions of Appropriate Level of Code and Data Disclosure on RealClimate.org, Before and After the CRU Leak

For years researchers had requested the data and programs used to produce Mann’s Hockey Stick result, and were resisted. The repeated requests for code and data culminated in Freedom of Information (FOI) requests, in particular those made by Willis Eschenbach, who tells his story of requests he made for underlying code and data up until the time of the leak. It appears that a file, FOI2009.zip, was placed on CRU’s FTP server and then comments alerting people to its existence were posted on several key blogs.

The thinking regarding disclosure of code and data in one part of the climate change community is illustrated in this fascinating discussion on the blog RealClimate.org in February. (Thank you to Michael Nielsen for the pointer.) RealClimate.org has 5 primary authors, one of whom is Michael Mann, and its primary author is Gavin Schmidt. In this RealClimate blog post from November 27, Where’s the Data, the position seems to be now very much all in favor of data release, but the first comment asks for the steps taken in reconstructing the results as well. This is right – reproducibility of results should be the concern (as argued here for example).

Policy and Public Relations

The Hill‘s Blog Briefing Room reported that Senator Inhofe (R-Okla.) will investigate whether the IPCC “cooked the science to make this thing look as if the science was settled, when all the time of course we knew it was not.” With the current emphasis on evidence-based policy making, Inhofe’s review should recommend code and data release and require reliance on verified scientific results in policy making. The Federal Research Public Access Act should be modified to include reproducibility in publicly funded research.

A dangerous ramification from the leak could be an undermining of public confidence in science and the conduct of scientists. My sense is that making code and data readily available in a way that facilitates reproducibility of results, can help avoid distractions from the real science, such as potential evasions of FOIA requests, whether or not data were fudged, or scientists acted improperly in squelching dissent or manipulating journal editorial boards. Perhaps data release is becoming an accepted norm, but code release for reproducibility must follow. The issue here is verification and reproducibility, without which it is all but impossible to tell whether the core science done at CRU was correct or not, even for peer reviewing scientists.

My Interview with ITConversations on Reproducible Research

On September 30, I was interviewed by Jon Udell from ITConversations.org in his Interviews with Innovators series, on Reproducibility of Computational Science.

Here’s the blurb: “If you’re a writer, a musician, or an artist, you can use Creative Commons licenses to share your digital works. But how can scientists license their work for sharing? In this conversation, Victoria Stodden — a fellow with Science Commons — explains to host Jon Udell why scientific output is different and how Science Commons aims to help scientists share it freely.”

Optimal Information Disclosure Levels: Data.gov and "Taleb's Criticism"

I was listening to the audio recording of last Friday’s “Scientific Data for Evidence Based Policy and Decision Making” symposium at the National Academies, and was struck by the earnest effort on the part of members of the Whitehouse to release governmental data to the public. Beth Noveck, Obama’s Deputy Chief Technology Officer for Open Government, frames the effort with a slogan, “Transparency, Participation, and Collaboration.” A plan is being developed by the Whitehouse in collaboration with the OMB to implement these three principles via a “massive release of data in open, downloadable, accessible for machine readable formats, across all agencies, not only in the Whitehouse,” says Beth. “At the heart of this commitment to transparency is a commitment to open data and open information..”

Vivek Kundra, Chief Information Officer in the Whitehouse’s Open Government Initiative, was even more explicit – saying that “the dream here is that you have a grad student, sifting through these datasets at 3 in the morning, who finds, at the intersection of multiple datasets, insight that we may not have seen, or developed a solution that we may not have thought of.”

This is an extraordinary vision. This discussion comes hot on the heels of a debate in Congress regarding the level of information they are willing to release to the public in advance of voting on a bill. Last Wednesday CBS reports, with regard to the health care bill, that “[t]he Senate Finance Committee considered for two hours today a Republican amendment — which was ultimately rejected — that would have required the “legislative” language of the committee’s final bill, along with a cost estimate for the bill, to be posted online for 72 hours before the committee voted on it. Instead, the committee passed a similar amendment, offered by Committee Chair Max Baucus (D-Mont.), to put online the “conceptual” or “plain” language of the bill, along with the cost estimate.” What is remarkable is the sense this gives that somehow the public won’t understand the raw text of the bill (I noticed no compromise position offered that would make both versions available, which seems an obvious solution).

The Whitehouse’s efforts have the potential to test this hypothesis: if given more information will people pull things out of context and promulgate misinformation? The Whitehouse is betting that they won’t, and Kundra does state the Whitehouse is accompanying dataset release with efforts to provide contextual meta-data for each dataset while safeguarding national security and individual privacy rights.

This sense of limits in openness isn’t unique to governmental issues and in my research on data and code sharing among scientists I’ve termed the concern “Taleb’s crticism.” In a 2008 essay on The Edge website, Taleb worries about the dangers that can result from people using statistical methodology without having a clear understanding of the techniques. An example of concern about Taleb’s Criticism appeared on UCSF’s EVA website, a repository of programs for automatic protein structure prediction. The UCSF researchers won’t release their code publicly because, as stated on their website, “We are seriously concerned about the ‘negative’ aspect of the freedom of the Web being that any newcomer can spend a day and hack out a program that predicts 3D structure, put it on the web, and it will be used.” Like the congressmen seemed to fear, for these folks openness is scary because people may misuse the information.

It could be argued, and for scientific research should be argued, that an open dialog of an idea’s merits is preferable to no dialog at all, and misinformation can be countered and exposed. Justice Brandeis famously elucidated this point in Whitney v. California (1927), writing that “If there be time to expose through discussion the falsehood and fallacies, to avert the evil by the processes of education, the remedy to be applied is more speech, not enforced silence.” Data.gov is an experiment in context and may bolster trust in the public release of complex information. Speaking of the Data.gov project, Noveck explained that “the notion of making complex information more accessible to people and to make greater sense of that complex information was really at the heart.” This is a very bold move and it will be fascinating to see the outcome.

Crossposted on Yale Law School’s Information Society Project blog.

What's New at Science Foo Camp 2009

SciFoo is a wonderful annual gathering of thinkers about science. It’s an unconference and people who choose to speak do so. Here’s my reaction to a couple of these talks.

In Pete Worden’s discussion of modeling future climate change, I wondered about the reliability of simulation results. Worden conceded that there are several models doing the same predictions he showed, and they can give wildly opposing results. We need to develop the machinery to quantify error in simulation models just as we routinely do for conventional statistical modeling: simulation is often the only empirical tool we have for guiding policy responses to some of our most pressing issues.

But the newest I saw was Bob Metcalfe’s call for us to imagine what to do with the coming overabundance of energy. Metcalfe likened solving energy scarcity to the early days of Internet development: because of the generative design of Internet technology, we now have things that were unimagined in the early discussions, such as YouTube and online video. According to Metcalfe, we need to envision our future as including a “squanderable abundance” of energy, and use Internet lessons such as standardization and distribution of power sources to get there, rather than building for energy conservation.

Cross posted on The Edge.

Bill Gates to Development Researchers: Create and Share Statistics

I was recently in Doha, Qatar, presenting my research on global communication technology use and democratic tendency at ICTD09. I spoke right before the keynote, Bill Gates, whose main point was that when you engage in a goal-oriented activity, such as development, progress can only be made when you measure the impact of your efforts.

Gates paints a positive picture, measured by deaths before age 5. In the 1880’s he says about 30% of children died before their 5th birthday in most countries, and this gradually moved to 20 million in 1960 and then 10 million in 2006. Gates postulates this is due to rising income levels (40% of decrease), and medical innovation such as vaccines (60% of decrease).

This is an example of Gates’ mantra: you can only improve what you can measure. For example, an outbreak of measles tells you your vaccine system isn’t functioning. In his example about childhood deaths, he says we are getting somewhere here because we are measuring the value for money spent on the problem.

Gates thinks the wealthy in the world need to be exposed to these problems ideally through intermingling, or since that is unlikely to happen, through statistics and data visualization. Collect data, then communicate it. In short, Gates advocates creating statistics through measuring development efforts, and changing the world by exposing people to these data.

Wolfram|Alpha Demoed at Harvard: Limits on Human Understanding?

Yesterday Stephen Wolfram gave the first demo of Wolfram|Alpha, coming in May, what he modestly describes as a system to make our stock of human knowledge computable. It includes not just facts, but also our algorithmic knowledge. He says, “Given all the methods, models ,and equations that have been created from science and analysis – take all that stuff and package it so that we can walk up to a website and ask it a question and have it generate the knowledge that we want. … like interacting with an expert.”

It’s ambitious, but so are Wolfram’s previous projects: Mathematica and Mathworld. I remember relying on Mathworld as a grad student – it was excellent, and so I remember when it suddenly disappeared when the content was to be published as a book. In 2002 he published A New Kind of Science, arguing that all processes, including thought, can be viewed as computations and a simple set of rules can describe a complex system. This thinking is clearly evident in Wolfram|Alpha and here are some key examples.
Continue reading ‘Wolfram|Alpha Demoed at Harvard: Limits on Human Understanding?’

Sunstein speaks on extremism

Cass Sunstein, Professor at Harvard Law School, is speaking today on Extremism: Politics and Law. Related to this topic, he is the author of Nudge, Republic.com 2.0, and Infotopia. He discussed Republic 2.0 with Henry Farrell on this bloggingheads.tv diavlog, which touches on the theme of extremism in discourse and the web’s role is facilitating polarization of political views (notably, Farrell gives a good counterfactual to Sunstein’s claims, and Sunstein ends up agreeing with him).

Sunstein is in the midst of writing a new book on extremism and this talk is a teaser. He gives us a quote from Churchill: “Fanatics are people who can’t change their minds and will not change the subject.” Political scientist Hardin says he agrees with the first clause epistemologically but the second clause is wrong because they *cannot* change the subject. Sunstein says extremism in multiple domains (The Whitehouse, company boards, unions) results from group polarization.

He thinks the concept of group polization should replace the notion of group think in all fields. Group Polarization involves both information exchange and reputation. His thesis is that like-minded people talking with other like-minded people tend to move to more extreme positions upon disucssion – partly because of the new information and partly because of the pressure from peer viewpoints.

His empirical work on this began with his Colorado study. He and his coauthors recorded the private views on 3 issues (climate change, same sex marriage and race conscious affirmative action) for citizens in Boulder and for citizens in Colorado Springs. Boulder is liberal so they screened people to ensure liberalness: if they liked Cheney they were excused from the test. They asked the same Cheney question in Colorado Springs and if they didn’t like him they were excused. Then he interviewed them to determine their private view after deliberation, and well as having come to a group consensus.
Continue reading ‘Sunstein speaks on extremism’

Legal Barriers to Open Science: my SciFoo talk

I had an amazing time participating at Science Foo Camp this year. This is a unique conference: there are 200 invitees comprising some of the most innovative thinkers about science today. Most are scientists but not all – there are publishers, science reporters, scientific entrepreneurs, writers on science, and so on. I met old friends there and found many amazing new ones.

One thing that I was glad to see was the level of interest in Open Science. Some of the top thinkers in this area were there and I’d guess at least half the participants are highly motivated by this problem. There were sessions on reporting negative results, the future of the scientific method, reproducibility in science. I organized a session with Michael Nielsen on overcoming barriers in open science. I spoke about the legal barriers and O’Reilly Media has made the talk available here.

I have papers forthcoming on this topic you can find on my website.

A2K3 Kaltura Award

I am honored and humbled to win the A2K3 Kaltura prize for best paper. Peter Suber posts about it here and gives the abstract. His post also includes a link to a draft of the paper, which can also be found here: Enabling Reproducible Research: Open Licensing For Scientific Innovation. I’d love comments and feedback although please be aware that since the paper is forthcoming in the International Journal of Communications Law and Policy it will very likely undergo changes. Thank you to Kaltura.com and the entire A2K3 committee. I’m very happy to be here in Geneva and enjoying every minute. 🙂

Internet and Cell Phone Use in the Middle East

When people talk about the Internet and Democracy, especially in the context of the Middle East, I wonder just how pervasive the Internet really is in these countries. I made a quick plot of data for Middle Eastern countries from data I downloaded from the International Telecommunciation Union:

The US is the blue line on top, for reference. UAE is approaching American levels of internet use, and Iran has skyrocketed since 2001, and is now the 3rd or 4th most wired country. There seem to be a cluster of countries that, while adopting, are doing so slowly: Saudi Arabia, Egypt, Syria, Oman, the Sudan, and Yemen, although Saudi Arabia and Syria seem to be accelerating since 2005.

I made a comparable plot of cellphone use per 100 inhabitants for these same countries, also from data provided by the ITU:

In this graph the United States is in the middle of the pack and growing steadily, but definitely not matching the recent subscription growth rates in the UAE, Qatar, Bahrain, Saudi Arabia, and Oman (data for 2007 for Israel is not yet available). For most countries, cell phones subscriptions are more than three times as prevalent as Internet users. Interestingly, the group of countries with low internet use also have low cell phone use, but unlike for the internet their cell phone subscription rates all began accelerating in 2005.

So what does this mean? All countries in the Middle East are growing more quickly in adopting cell phones than the internet, with the interesting exception of Iran (I don’t know why the growth rate of internet use in Iran is so high, perhaps blogging has caught on more here. Although it doens’t address this question directly, the Iranian blogosphere itself is analyzed in the Berkman Internet & Democracy paper Mapping Iran’s Online Public: Politics and Culture in the Persian Blogosphere). Syria, the Sudan, Yemen, and Iran have grown most quickly in both internet use and cell phone subscription. In 6 countries there is more than one cell phone subscription per person – conversely, the highest rate for internet use (other than the US) is 50% in UAE with the other countries in approximately two clusters of about 30% and about 10% each. With the rates of growth on the side of the cell phones, I doubt we’ll see their pervasiveness relative to internet use change in the next few years, in fact the gap will probably widen.

Cross-posted at I&D Blog

John Kelly: Parsing the Political Blogosphere

John Kelly is a doctoral student a Columbia’s School of Communications, a startup founder (Morningside Analytics), as well as doing collaborative work with Berkman. He’s speaking Berkman’s Media Re:public Forum.

Kelly says he takes an ecosystem approach to studying the blogosphere since he objects to dividing research on society into cases and variables because it is an interconnection whole. This isn’t right and basic statistical methods that use variables and cases and designed specifically to take interconnections into account. What he is doing with the research he presents today is using a graphical tool to present descriptions of the blogosphere.

Kelly shows a map of the entire blogosphere and the outlinks from the blogosphere. Every dot is a blog and any blogs that are linked are pulled together – so the map itself looks like clusters and neighborhoods of blogs. The plot seems slightly clustered but there is an enormous amount of interlinking (my apologies for not posting pictures – I don’t think this talk is online). In the outlinks maps to links from blogs to other sites – the New York Times is most frequently linked to and thus the largest dot on the outlinks map.

Kelly compares maps for 5 different language blogospheres: English, Persian, Russian, Arabic, and Scandinavian languages. Russian has very separate clusters and other languages get progressively more interconnected. In the Persian example, Kelly has found distinct clusters of ex-pat cloggers, poetry, and religious conservative bloggers concerned about the 12th Inam, as well as clusters of modern and moderately traditional religious and political bloggers. Kelly suggests this is a more disparate and discourse oriented picture than we might have thought.

In the American blogosphere Kelly notes that bloggers tend to link overwhelmingly to other blogs that are philosophically aligned with their own blog. He shows an interesting plot of Obama, Clinton, McCain blogopsheres’ linking patterns to other sites such as thinktanks and particular YouTube videos.

Kelley also maps a URL’s salience: main stream media articles peak quickly and are sometimes over taken by responses, but Wikipedia article keep getting consistent hits over time.

The last plot he shows is a great one of the blogs of the people attending this conference (and their organizations): there are 5 big dots representing how much people have blogged about the people – main stream media sites are the 5 big dots. Filtering out of those gives GlobalVoices as the blogs people mainly link to.

Crossposted on I&D Blog

Book Review: "Development as Freedom" by Amartya Sen

What is a developed country? According to Sen, development should be measured by how much freedom a country has since without freedom people cannot make the choices that allow them to help themselves and others. He defines freedom as an interdependent bundle of:

1) political freedom and civil rights,
2) economic freedom including opportunities to get credit,
3) social opportunities: arrangements for health care, education, and other social services,
4) transparency guarantees, by which Sen means interactions with others, including the government, are characterized by a mutual understanding of what is offered and what to expect,
5) protective security, in which Sen includes unemployment benefits, famine and emergency relief, and general safety nets.

Respect for Local Decisions

By defining the level of development by how much the country has, Sen largely sidesteps a value judgment of what it means specifically to be a developed country – this isn’t the usual laundry list of Western institutions. It’s a bold statement – he gives the example of a hypothetical community deciding whether to disband their current traditions and increase lifespans. Sen states he would leave it up to the community and if they decide on shorter lifespans, in the full-freedom environment he imagines, this is perfectly consistent with the action of a fully developed country (although Sen doesn’t think anyone should have to chose between life and death – this is the reason for freedom 3). This also is an example of the inherent interrelatedness of Sen’s five freedoms – the community requires political freedom to discuss the issues, come to a conclusion and have it seen as legitimate, with social opportunities and education for people to engage in such a discussion.

Crucial Interrelatedness of the Freedoms

Sen is quite adamant that these five freedoms be implemented together and he makes an explicit case against the “Lee Thesis” – that economic growth must be secured in a developing country before other rights (such as political and civil rights) are granted. This is an important question among developing countries who see Singapore’s success as the model to follow. Sen notes that it is an unsettled empirical question whether or not authoritarian regimes produce greater economic growth, but he argues two points: that people’s welfare can be addressed best through a more democratic system (for which he sees education, health, security as requisite) since people are able to bring their needs to the fore; and that democratic accountability provides incentives for leaders to deal with issues of broad impact such as famines or natural catastrophes. His main example of the second point is that there has never been a famine under a democratic regime – it is not clear to me that this isn’t due to reasons other than the incentives of elected leaders (such as greater economic liberty), but whether or not there is a correlation is something the data can tell. Sen notes that democracies provide protective security and transparency (freedoms four and five) and this is a mechanism through which to avert things like the Asian currency crisis of 1997. Democratic governments also have issues with transparency but this seems to me an example of how democracy avoids really bad decisions even though it might not make the optimal choices. Danny Hillis explained why this is the case in his article How Democracy Works.

Choosing not to Choose (Revisited)

Sen reasons that since no tradition of suppressing individual communication exists, this freedom as not open to removal via community consensus. Sen also seems to assume that people won’t vote away their right to vote. He doesn’t deal with this possibility explicitly but this is what Lee Kuan Yew was afraid of – communists gaining power and being able to implement an authoritarian communist regime. Sen’s book was written in 1999 and doesn’t mention Islam or development in the Middle Eastern context, so he never grapples with the issues like the rise of Shari’a Law in developing countries such as Somalia. I blogged about the paradox of voting out democracy in Choosing not to Choose in the context of the proposed repeal of the ban on headscarves in Turkish universities and the removal of the Union of Islamic Courts (UIC) in Somalia in 2006. I suspect Sen’s prescription in Turkey would be to let the local government decide on the legality of headscarves in universities (thus the ban would be repealed), and implement all five forms of freedom in Somalia and thus explicitly reject an authority like the UIC.

The Internet

Sen doesn’t mention the internet but what is fascinating is that communication technologies are accelerating the adoption of at least some of Sen’s 5 freedoms, particularly where the internet is creating a new mechanism for free speech and political liberty that is nontrivial for governments to control. The internet seems poised to grant such rights directly, and can indirectly bring improvements to positive rights such as education and transparency (see for example MAPLight.org and The Transparent Federal Budget Project). Effective mechanisms for voices to be heard and issues to be raised are implicit in Sen’s analysis.

What Exactly is Sen Suggesting We Measure?

Sen subjects his proposed path to development, immediately maximizing the amount of freedoms 1 through 5, to some empirical scrutiny throughout the text but he doesn’t touch on how exactly to measure how far freedom has progressed. He suggests longevity, health care, education are important factors and I assume he would include freedom of speech, openness of the media, security, and government corruption metrics but these are notoriously hard to define and measure (and measuring longevity actually runs counter to Sen’s example of the hypothetical community above… but Sen strongly rejects the argument that local culture can permit abridgment of any of his 5 freedoms, particularly the notion that some cultures are simply suited to authoritarian rule). The World Bank compiles a statistical measurement of the rule of law, corruption, freedom of speech and others, that gets close to some of the components in Sen’s definition of freedom. This also opens the question of what is appropriate to measure when defining freedom. And whether it is possible to have meaningful metrics for concepts like the rule of law or democracy.

Sen eschews two common ways of thinking about development: 1) that aid goes to passive recipients and 2) that increasing wealth is the primary means by which development occurs. His motivation seems to come from a deep respect for subjective valuation: the individual’s autonomy and responsibility in decision making.

Crossposted on I&D Blog

Patrick Ball in NYT Magazine

Recently the Berkman Internet and Democracy group hosted a conference on Digital Activism in Istanbul. One of the attendees was Patrick Ball, Chief Scientist and Director of the Human Rights program at Benetech. His work was the focus of a story in the February 17 edition of New York Times Magazine. The article was called The Forensic Humanitarian and it describes the use of probability theory, such as capture-recapture, in the difficult problem of the estimation of death counts. It summarizes Ball’s work in the estimation of Colombian murders and whether or not they are decreasing over time, as the official numbers seem to indicate. Through examining various recordings of the same data, and intepreting them as samples from the same population, Ball can estimate the total number of murders whether reported or not.

Crossposted on I&D Blog