War Data: Visualization of Afghanistan Hotspots Using Wikileaks Data

How would you take a data set released by Wikileaks and visualize it to see activity in Afghanistan over time? As part of this week’s theme of war data, I present a visualization based on leaked war data.

Mike Dewar, Drew Conway, John Myles White, and Harlan Harris used R code (the scripts, etc., are available on github) to generate the animation below from the Wikileaks Afghanistan War Logs.

(Note: You may access overview data via The Guardian’s DataBlog, whose datajournalists provided their own analyses of the leaked data.)

Continue reading “War Data: Visualization of Afghanistan Hotspots Using Wikileaks Data”

Google Releases World Map of Government Censorship Requests

How would you display a somewhat abstract term like “censorship” to your users and the rest of the world?

Earlier this week, Google released the latest version of their censorship map. Via the BBC: “the new map and tools follows on from that and allows users to click an individual country to see how many removal requests were fully or partially complied with, as well as which Google services were affected.”

A Google employee released this statement on their blog:

Like other technology and communications companies, we regularly receive requests from government agencies around the world to remove content from our services, or provide information about users of our services and products. This map shows the number of requests that we received in six-month blocks with certain limitations.

We’re still learning the best way to collect and present this information. We’ll continue to improve this tool and fine-tune the types of data we display.

You may see the map via Google. (Sorry, I couldn’t embed it into this post.)
Continue reading “Google Releases World Map of Government Censorship Requests”

The Humanities Take on Data Mining via Google Books

binary dataThe Humanities are “Going Google”, according to Marc Parry of The Chronicle, in a piece he wrote a few weeks ago.

The gist of the article is that some Humanities scholars are very interested in data mining the texts scanned in for the Google Books Project.

Why do they want to use Big Data mining techniques to scan through entire corpuses of novels from a particular period? “The data are important because scholars can use these macro trends to pinpoint evolutionary mutants like Sir Walter Scott”, one scholar noted.

Some critics rightfully ask, what will this tell us that we don’t already know?

Their answer is that computers won’t destroy interpretation. They’ll ground it in a new type of evidence.

Still, sitting in his darkened office, Mr. Moretti is humble enough to admit those “cellars of culture” could contain nothing but duller, blander, stupider examples of what we already know. He throws up his hands. “It’s an interesting moment of truth for me,” he says.

(I think this is a backhanded critique of “research” in general, so I had a good laugh when I read this paragraph.)

Other takeaways — Google Books was not built for data mining, it was built to create content to sell ads against. It was built with the intention that each book will be read, one at a time, not data mined. The interfaces aren’t there for this kind of mining, and the metadata is poor to say the least. (Then again, metadata is generally inadequate; this problem is so “known” I won’t provide a citation!)

What do you think are the moral, legal, and scholarly implications (if any) of Google turning over thousands of scanned books to a handful of scholarly institutions, such as Stanford, for data mining?

A Short History of Scientific Information Services

A Short History of Scientific Information ServicesIn the following videos, the producer traces the history of scientific communication from verbal/in-person, to letters, and then to printed journals. The producer describes the work of ISI and the company’s founder, Eugene Garfield. Journals grew from a handful to thousands. This led to classification and indexing in order to find relevant journal articles via print. In the early 1960s, ISI digitized this indexing and classification system in order to aid in finding the required material. Only a small portion of literature is actually important enough to be cited often, thus, citation indexing was born.

(For those of you who are unfamiliar with citation indexing, and may be wondering why it is important — among many reasons…the founders of Google applied citation indexing to web links to create PageRank. They were not the first to apply citation indexing to web links, but they were among the first to figure out an entire business model around it by mining and selling the user generated data.)

This is one video that has been split into three parts for ease of viewing online. I found them interesting to watch. The videos were made, as far as I can tell, in the early 1970s, and they are infomercials for ISI. I have embedded the three parts below. I found the first one to be more fun to watch then the latter two. Those, however, are interesting from a recent-information-services-history perspective.

Part 1

Part 2

Part 3

Thanks, L.S. for the links.

Google Books Settlement Paths Forward Diagram Released

GBS March Madness: Paths Forward for the Google Books SettlementJonathan Band has developed a diagram charting the possible outcomes of the Google Books settlement. Designed by Tricia Donovan, the diagram is called, “GBS March Madness: Paths Forward for the Google Books Settlement”, and has been released by the American Library Association (ALA), the Association of College and Research Libraries (ACRL), and the Association of Research Libraries (ARL) as a poke at the NCAA’s March Madness basketball tournament. I have posted the full text of the press release below.

Library Copyright Alliance Releases Diagram Charting Many Ways Forward For Google Books Settlement

WASHINGTON DC—The American Library Association (ALA), the Association of Research Libraries (ARL), and the Association of College and Research Libraries (ACRL) announce the release of “GBS March Madness: Paths Forward for the Google Books Settlement.” This diagram, developed by Jonathan Band, explores the many possible routes and outcomes of the Google Books Settlement, including avenues into the litigation and appeals process.

Now that the fairness hearing on the Google Books Settlement has occurred, it is up to Judge Chin to decide whether the amended settlement agreement (ASA), submitted to the Court by Google, the Authors Guild, and the Association of American Publishers, is “fair, reasonable, and adequate.” As the diagram shows, however, Judge Chin’s decision is only the next step in a very complex legal proceeding that could take a dozen more turns before reaching resolution. Despite the complexity of the diagram, it does not reflect every possible twist in the case, nor does it address the substantive reasons why a certain outcome may occur or the impact of Congressional intervention through legislation. As Band states, “the precise way forward is more difficult to predict than the NCAA tournament. And although the next step in the GBS saga may occur this March, many more NCAA tournaments will come and go before the buzzer sounds on this dispute.”

To view the diagram, please visit: http://www.librarycopyrightalliance.org/bm~doc/gbs-march-madness-diagram-final.pdf

I had a good look over the Google Books settlement March Madness diagram. I am fascinated, but not surprised, at how complicated this process will continue to be for all of the parties involved. Overall, I like the layout and design, and the consistent use of color and direction. I think different software might have given it a more professional look, but the developer and designer managed to cram in a lot of information in a small amount of space. Since I am neither a lawyer nor a copyright expert, I cannot vet the content. This diagram does not take into account what litigation has already occurred.

If you were to re-create this diagram, keeping the same content, would you make any changes to the Information Design aspects? What do you think of the litigation process going forward?

[Via Brandon Butler at ARL.]

Gaiman’s “MirrorMask” Library Cleverness

Gaiman’s “MirrorMask” Library ClevernessThis past week I watched Neil Gaiman’sMirror Mask“. The book-as-film chronicles the dream of a teenager whose mother has become ill and is undergoing surgery. In the scene below, the teenager, Helena, goes to the library with her New Best Friend, Valentine, to find clues to a missing charm. They arrive via flying books (you can see the flying books in the first few seconds of the video). The books fly back to the library if the reader insults them. The two characters insulted some Very Large Books, both to Get Out of a Predicament and to be Taken to the Library to find clues to the missing charm.

I love this scene. The reference librarian is funny, the library has an interesting design, and the fact that the books are alive and that Helena and Valentine have to use nets to “catch” them is cute. I was also amused by the idea of books molting due to depression, because there weren’t chosen to be read. I also wished I could have a copy of the Really Useful Book for myself!

I thought Gaiman showed great creativity and fun in creating new ways to store, access, and retrieve written, “analog” information.

Do you have any information retrieval favorites from fiction and/or film?

The Public Domain Manifesto

The Public Domain ManifestoThe Public Domain Manifesto” has been released by COMMUNIA, the European Thematic Network on the digital public domain.

If you would like to show your support for this cause, after you have read “The Public Domain Manifesto”, you may sign it. You may choose whether or not you would like your signature displayed online. Below, I have copied The Preamble verbatim. The full text of “The Public Domain Manifesto” is available at publicdomainmanifesto.org.

Preamble

“Le livre, comme livre, appartient à l’auteur, mais comme pensée, il appartient—le mot n’est pas trop vaste—au genre humain. Toutes les intelligences y ont droit. Si l’un des deux droits, le droit de l’écrivain et le droit de l’esprit humain, devait être sacrifié, ce serait, certes, le droit de l’écrivain, car l’intérêt public est notre préoccupation unique, et tous, je le déclare, doivent passer avant nous.” (Victor Hugo, Discours d’ouverture du Congrès littéraire international de 1878, 1878)

“Our markets, our democracy, our science, our traditions of free speech, and our art all depend more heavily on a Public Domain of freely available material than they do on the informational material that is covered by property rights. The Public Domain is not some gummy residue left behind when all the good stuff has been covered by property law. The Public Domain is the place we quarry the building blocks of our culture. It is, in fact, the majority of our culture.” (James Boyle, The Public Domain, p.40f, 2008)

The public domain, as we understand it, is the wealth of information that is free from the barriers to access or reuse usually associated with copyright protection, either because it is free from any copyright protection or because the right holders have decided to remove these barriers. It is the basis of our self-understanding as expressed by our shared knowledge and culture. It is the raw material from which new knowledge is derived and new cultural works are created. The Public Domain acts as a protective mechanism that ensures that this raw material is available at its cost of reproduction – close to zero – and that all members of society can build upon it. Having a healthy and thriving Public Domain is essential to the social and economic well-being of our societies. The Public Domain plays a capital role in the fields of education, science, cultural heritage and public sector information. A healthy and thriving Public Domain is one of the prerequisites for ensuring that the principles of Article 27 (1) of the Universal Declaration of Human Rights (‘Everyone has the right freely to participate in the cultural life of the community, to enjoy the arts and to share in scientific advancement and its benefits.’) can be enjoyed by everyone around the world.

The digital networked information society has brought the issue of the Public Domain to the foreground of copyright discussions. In order to preserve and strengthen the Public Domain we need a robust and up-to-date understanding of the nature and role of this essential resource. This Public Domain Manifesto defines the Public Domain and outlines the necessary principles and guidelines for a healthy Public Domain at the beginning of the 21st century. The Public Domain is considered here in its relation to copyright law, to the exclusion of other intellectual property rights (like patents and trademarks), and where copyright law is to be understood in its broadest sense to include economic and moral rights under copyright and related rights (inclusive of neighboring rights and database rights). In the remainder of this document copyright is therefore used as a catch-all term for these rights. Moreover, the term ‘works’ includes all subject-matter protected by copyright so defined, thus including databases, performances and recordings. Likewise, the term ‘authors’ includes photographers, producers, broadcasters, painters and performers.

So, you may ask, what does this have to do with managing your data? It is about whether or not you have access to the works of others, and whether or not they have access to your work. It is about managing how and who uses and does not use your data/information/product, as well as when, and for how long. Ideally, maintaining a “Public Domain” contributes to the cultural output of a society because the producers receive some copyright protection, but their creative output is not copyrighted in perpetuity. This opens the products of their mind to continued use and reuse by later generations.

[Via Nat T.]

What is Your Digital Fingerprint?

binary dataAs a result of Data Privacy Day last week, I have spent the past few days poking around online to see what data about myself I could discover that I didn’t know existed. Before I try to tame others’ data, perhaps I should try taming my own?

I searched under various versions of my name. Now, I admit to engaging in “ego searches” before, but I have never gone through every major search engine and examined every page of the results. Most of it was boring, to be honest. I’m just not that interesting. The links were about this conference, that conference, this old CV, some old presentation. However, some other information I found associated with my name was interesting to me, and it was all new (and news) to me.

For example, I discovered that “someone” had taken a programming assignment from a course I had taken 8 years ago and put the homework online on an assignment sharing site. My name and the course number were still on it, and I was able to compare it to the original assignment. I immediately wrote the company who owned the site, and they did remove the assignment. I also increased security measures on the public html directory provided by my graduate program.

I discovered that someone had stored my master’s paper in a repository in…Argentina. I expected downloads of my master’s paper for personal use. I did not expect it to be stored in a repository without my consent. I found that one to be a bit odd, but I left it alone. I also didn’t realize that Google tracks what I watch on YouTube via iGoogle, if I am logged into my account. I can, however, delete most information about me that Google stores. (Please see the Google Privacy Center for more information on how to view your account-related Google data.)

I also read “‘I’ve Got Nothing to Hide’ and Other Misunderstandings of Privacy” by Daniel J. Solove. In this article, he argues that whether or not you have something to hide isn’t the point. Privacy isn’t about whether or not you have something to hide, it is about what is and isn’t someone else’s business. It is about the balance of power. It is about knowing what the government or a corporation is storing about you, having the right to opt out, and having the right to change any erroneous information. Even if data is anonymized and machine-analyzed, what business is it of the government, corporations, or organizations to hold this data in the first place? Can this information that has been gathered about you without your knowledge or consent be held against you at some future point?

For example, today I learned that the government mandates genetic testing of all newborns. Because it is the law, parents are not required to provide consent before the testing is performed. Did you know that some states will hold copies of your baby’s DNA indefinitely? In MN, the DNA is stored attached to identifying information in the event the child goes missing and/or dies. Some states do allow you to opt out, and will destroy the genetic material upon request. Your DNA is very personal. I don’t mind the testing of babies, I do mind if the DNA is stored by any organization for an indefinite period. In one instance, a baby tested positive for cystic fibrosis, and this result will be stored in her records with the insurance company, because the cost of the test was covered by insurance. (Note: the parents stated they would have paid out of pocket for the testing, if they’d known about the testing requirement beforehand, in order to avoid this black mark on their child’s health insurance record.) Will this information be held against this child down the road? What if other tests are developed, for manic-depression or other disorders? Will the indefinitely stored results of these tests prevent these babies from getting health insurance or a job in the future?

Michelle G. Hough wrote a fascinating article, entitled, “Keeping It to Ourselves: Technology, Privacy, and the Loss of Reserve“. The author defined reserve as the “ability to control what information about us is disclosed, and what is not”. She cites a previous study by Sweeney which found, using 1990 census data, that with only the combination of zip code, birth date, and gender, 87% of the U.S population could be identified. If you combine that data set with a 3rd party “anonymized” data set that contains related information, you could identify the users in the 3rd party data set. The conclusion? We need to think and talk more about privacy, reserve, and how much of those we are willing to lose in exchange for the advantages technological innovation brings us.

Anonymized data is not as “anonymous” as one might desire. The Electronic Frontier Foundation estimates that in order to identify one individual in the entire population of the planet, you need only 32.6 bits of information. The organization is conducting an experiment to determine how unique browser configurations are, and whether or not effective online tracking can be accomplished by corporations, organizations, and/or the government. The experiment is a project called Panopticlick. I went to the web site and let the software test my browser configuration.

I learned that I have 19 bits of identifying information in my browser fingerprint. However, my browser fingerprint does appear to be unique among the 572,016 browsers tested so far.

I encourage you to poke around online and check your digital fingerprint. This was a time-consuming exercise for me, but an enlightening one.

Beginning a Series — Reviews of Open Data Sites

binary codeI will be reviewing English-language, government-sponsored open data sites as an off-shoot of my doctoral work. I will begin initially with the “key” government sites compiled by the authors of The Guardian‘s DataBlog as one of their inaugural posts.

Last week I reviewed data.gov.uk, so I while I may add a bit more detail to my initial review in a second post, I will not completely re-review it. The sites I will review in the upcoming weeks are:

I will also do some searching of my own and see what else I can locate that is an English-language, government-sponsored Open Data web site. However, if any of you know of any sites that I do not have listed above, please do send them to me! (And, “thanks!” in advance.)

So…what do I mean by “review”? I plan to examine the number and types of data sets made available, policies for use and re-use, “other” policies, and, the overall “look & feel” and usability of the site(s). I will also discuss “anything else” I find interesting.

Is there something in particular you’d like me to add to my review criteria?

HM Government Opens Up Government Data to the Public

Data.gov.uk-in-preview-001The British Government has released data sets to the public for use in either the public or private sectors at data.gov.uk.

Previously, the governments of the United States, Australia, and New Zealand had created data sites for use by the public, including commercial use. The primary idea behind the release of these data sets is that publicly funded data ought to be made available to the public for free for re-use. The site creators hope that individuals and businesses will use the data creatively to add economic value and generate new services. Sir Tim Berners-Lee and Professor Nigel Shadbolt led the project in the UK.

The Guardian has posted a video interview with Berners-Lee and Shadbolt. Shadbolt gave an example of one re-use of this data by the public: an online route-planning tool that helps cyclists avoid areas where cyclists have the most accidents. Both project leaders discuss how the project developed, why they wanted to put government data online, why the data was released for free, and their hopes for data re-use.

The Open Data Principles the creators state on the site are as follows:

  • Public data will be published in reusable, machine-readable form
  • Public data will be available and easy to find through a single easy to use online access point (http://www.data.gov.uk/)
  • Public data will be published using open standards and following the recommendations of the World Wide Web Consortium
  • Any ‘raw’ dataset will be re-presented in linked data form
  • More public data will be released under an open licence which enables free reuse, including commercial reuse
  • Data underlying the Government’s own websites will be published in reusable form for others to use
  • Personal, classified, commercially sensitive and third-party data will continue to be protected.

Currently, the site is set up for users to run basic searches on just under 150 data sets. There are around 20 applications listed for use. I browsed through the available data sets. The available topics begin with 2008 Injury Road Traffic Collisions in Northern Ireland and end with a Youth Cohort Study & Longitudinal Study of Young People in England.

I look forward to following this project, seeing what data is added, and what re-uses of the data are made. I have not attempted to use any of the data sets, so I cannot report on any success or problems I have had with using them. If you have used or do use any of these data sets or applications, please let me know.

[Thanks, Jennifer M.]