A few months ago, Robert Lee Holtz wrote an article in the Science Journal section of the Wall Street Journal where he discussed how the data deluge is swamping scientists and researchers.
The author addressed the particular issue of how curators store data for current and future access so that other scientists may access the data in order to authenticate and reproduce the discoveries. There is no paper trail, unlike in the past, as collaborations use the entire range of web technologies from Facebook to Twitter and back around again. The software and hardware programs used to run the experiments quickly become obsolete, thus, future scientists will not be able to easily reproduce the experiments. (Reminder for you non-Scientists: the reproducibility of experiments is a basic tenet of the Scientific Method.)
Then there is the sheer scale of the produced data: astrophysicist Alexander Szalay states that more data has been collected in the past year alone than in the entire history of science. That problem is only going to grow worse, as new experiments produce even greater amounts of data. The Large Hadron Collider at CERN is expected to produce “15 million gigabytes of data annually — enough to fill more than 1.7 million DVDs every year”, while the Large Synoptic Survey Telescope will image the sky and regularly record “more than 30,000 gigabytes every night”.
The problem is forcing historians to become scientists, and scientists to become archivists and curators.
Currently, there is no solution to the problem. Our ability to generate, capture, and record data far outweighs our ability to preserve it for the indefinite long-term. One part of the problem is media degradation. Most physical storage systems — tape, CD, DVD, flash drive, etc. — last about a decade. Researchers in Japan and California have developed physical media solutions that the former states will allow data to last for centuries and that the latter states will last a billion years (!).
(I’m a bit skeptical of those claims. How does one “prove” in Current Time that one’s physical media will last centuries or for a billion years? Caveat: I have not read the research, so that is a blanket opinion on my part.)
One of the most valuable sections of the article is the reading list. I’ve read several of these reports in their entirety. If the recent cold weather and/or disaster in Haiti has kept you awake at night, these reports might help you get to sleep.
But seriously, there’s some good information in here for those of you who care about data preservation and would like a starting point for research and reflection.
The RAND Corp. says that researchers creating important digital data sets show an alarming lack of concern about preserving them in digital preservation: The uncertain future of saving the past.
The Library of Congress is leading the National Information Infrastructure and Preservation Program. Last year, the library added 80 terabytes of data to its digital archives.
The Blue Ribbon Task Force on Sustainable Digital Preservation and Access is studying ways to cover the long-term cost of access to the ever-growing amount of digital information in the public interest.
At the U.S. National Academies, the board on research data and information is studying Permanent Access to Scientific Data: Preservation and Archiving of Scientific and Technical Data.
The National Geological and Geophysical Data Preservation Program aims to preserve collections of materials and data for future scientific research and educational activities.
The Data Preservation Alliance for the Social Sciences is working to identify and save opinion polls, voting records, large-scale surveys on family growth and income, and many other social science studies.
High-energy physics experiments acquire huge datasets that may not be superseded by new and better measurements for decades or centuries. Researchers recently conferred on Data Preservation and Long-term Analysis in High-Energy Physics.
I was pleased to see the need for data management and curation addressed in the article. Yes, the article is high-level, but it is a short newspaper article aimed at a general audience, not data curators. It should be high-level. I thought it was well-written.
[Thanks, Sarah R.]