N.B. I also write for the Communications of the ACM (CACM). The following essay recently appeared on the CACM blog.
How recently have you mounted a 9-track open reel tape, hoping to access the irreplaceable data that was the foundation of your first research paper? At this point, you may not even remember if it was 800, 1600 or 6250 bits per inch (bpi), EBCDIC or ASCII, blocked or unblocked. You are not that old, you say? What about your 5.25" or 3.5" floppy disks or DAT archive? Odds are you haven't accessed the data because you can't without seeking the services of a conversion company that specializes in data retrieval from obsolete media.
Have you ever been involved in a research project, either individually or as part of a multi-institutional team, that produced data intended for broader community use? If so, then you probably placed it on the project web site, perhaps with the research software needed to decode and process the data. A decade later, is the data still accessible and does the software even compile or execute on current systems?
Personally, I still have some 9-track computer tapes, a punched card deck and a paper tape in an office desk drawer, saved for both reasons of pedagogy and nostalgia. I also have some data analytics tools originally designed for workstations now found in the Computer History Museum.
These examples may seem quaint, and perhaps they are, but each of us has some variant of this data obsolescence and inaccessibility experience. They are the analog (pun intended) of our previous consumer media experiences. After all, have you played any of your 45s, 8-track tapes, or cassettes lately?
These are the symptoms of three bigger issues: the rapid obsolesce of specific storage technologies, the explosive growth of research data across all scientific and engineering disciplines, and the even more difficult task of sustaining data access past the end of research projects.
The first is a natural consequence of technological change, one that we have collectively managed across the history of modern digital computing. In turn, "big data" is a hot topic of research and business innovation, with new tools and techniques appearing to extract insights from large volumes of unstructured or ill-structured data. However, the social and economic challenges around research data preservation are profound and not yet resolved.
In the U.S., the Office of Science and Technology Policy (OSTP) on behalf of the National Science and Technology Council (NSTC) recently issued a request for information (RFI) on Public Access to Digital Data Resulting from Federally Funded Scientific Research. In addition, the National Science Foundation has instituted a requirement that research proposals include a data management plan, which notes that "Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants."
Several issues are convolved in our desire and expectations for data sharing. One is advancing discovery and innovation via the free flow of information, replicating and expanding experiments and sharing data from increasingly expensive national and international scientific instrumentation. This is central to the scientific process, though it brings the cultural disparity that exists across disciplines to the fore – astronomy, biology and computing are culturally quite different.
The second is the need for multidisciplinary sharing and cross-domain fertilization. Increasingly, new insights emerge from fusing and analyzing data drawn from diverse sources. Such integration places a premium on metadata schemas, well documented data formats and service access protocols. In turn, these require standards and coordination, both within and across disciplines.
The third is the distinction between research, which produces data, and data preservation, documentation and dissemination. In my experience, the skills and expertise, as well as reward metrics, are distinctly different for the two activities. This is an oblique way of saying that researchers will generally optimize for research advancement over data preservation when forced to choose between the two. This is only natural, given our current reward structure.
Research and data preservation also differ markedly in their timescales, for data preservation and dissemination services often require decadal planning, with associated infrastructure and professional staffing, rather than the 3-5 year funding for principal investigators, graduate students and post-doctoral research associates that is typical of research grants and contracts.
Finally, data preservation and dissemination can be expensive, rivaling or exceeding that of the initial research investment. Quite clearly, not everything can and should be preserved in perpetuity, but predicting the future value of data is both difficult and perilous.
This suggests that we need economic and social processes that more rigorously access the present and future value of data. They might include combinations of commercial, fee-based models, where researchers and organizations vote with their funds for access to and retention of certain data (i.e., a cloud-based research data marketplace), government funded and managed repositories where key data is retained (e.g., NIH's GenBank), or distributed but interconnected archives funded by multiple agencies and governments (e.g., the Worldwide LHC Computing Grid).
What is clear is that the dramatic growth of research data, the collaborative and competitive nature of international science and engineering research, expectations for economic returns from research investments and disciplinary differences all make this a pressing and difficult problem. Our current, ad hoc approaches are inadequate and not sustainable.
Comments
You can follow this conversation by subscribing to the comment feed for this post.