The Humanities Take on Data Mining via Google Books

binary dataThe Humanities are “Going Google”, according to Marc Parry of The Chronicle, in a piece he wrote a few weeks ago.

The gist of the article is that some Humanities scholars are very interested in data mining the texts scanned in for the Google Books Project.

Why do they want to use Big Data mining techniques to scan through entire corpuses of novels from a particular period? “The data are important because scholars can use these macro trends to pinpoint evolutionary mutants like Sir Walter Scott”, one scholar noted.

Some critics rightfully ask, what will this tell us that we don’t already know?

Their answer is that computers won’t destroy interpretation. They’ll ground it in a new type of evidence.

Still, sitting in his darkened office, Mr. Moretti is humble enough to admit those “cellars of culture” could contain nothing but duller, blander, stupider examples of what we already know. He throws up his hands. “It’s an interesting moment of truth for me,” he says.

(I think this is a backhanded critique of “research” in general, so I had a good laugh when I read this paragraph.)

Other takeaways — Google Books was not built for data mining, it was built to create content to sell ads against. It was built with the intention that each book will be read, one at a time, not data mined. The interfaces aren’t there for this kind of mining, and the metadata is poor to say the least. (Then again, metadata is generally inadequate; this problem is so “known” I won’t provide a citation!)

What do you think are the moral, legal, and scholarly implications (if any) of Google turning over thousands of scanned books to a handful of scholarly institutions, such as Stanford, for data mining?

Please let me know what you think....