Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories

Appeared in ACM Transactions on Storage 8(2).

Abstract

The scope of archival systems is expanding beyond cheap tertiary storage: scientific and medical data is increasingly digital, and the public has a growing desire to digitally record their personal histories. Driven by the increase in cost efficiency of hard drives, and the rise of the Internet, content archives have become a means of providing the public with fast, cheap access to long-term data. Unfortunately, designers of purpose built archival systems are either forced to rely on workload behavior obtained from a narrow, anachronistic view of archives as simply cheap tertiary storage, or extrapolate from marginally related enterprise workload data and traditional library access patterns.
To close this knowledge gap and provide relevant input for the design of effective long-term data storage systems, we studied the workload behavior of several systems within this expanded archival storage space. Our study examined several scientific and historical archives, covering a mixture of purposes, media types, and access models—that is, public versus private. Our findings show that, for more traditional private scientific archival storage, files have become larger, but update rates have remained largely unchanged. However, in the public content archives we observed, we saw behavior that diverges from the traditional “write-once, read-maybe” behavior of tertiary storage. Our study shows that the majority of such data is modified—sometimes unnecessarily—relatively frequently, and that indexing services such as Google and internal data management processes may routinely access large portions of an archive, accounting for most of the accesses. Based on these observations, we identify areas for improving the efficiency and performance of archival storage systems.

Publication date:
May 2012

Authors:
Ian Adams
Mark W. Storer
Ethan L. Miller

Projects:
Archival Storage
Tracing and Benchmarking

Available for download:

Full text:
Download as PDF

Bibtex entry

@article{adams-tos12,
  author       = {Ian Adams and Mark W. Storer and Ethan L. Miller},
  title        = {Analysis of Workload Behavior in Scientific and Historical
Long-Term Data Repositories},
  journal      = {ACM Transactions on Storage},
  volume       = {8},
  number       = {2},
  month        = may,
  year         = {2012},
}
Last modified 10 Oct 2012