Archival Storage

We have several active and past projects in archival storage, all of which are contributing to the ability to build more efficient, reliable, and secure long-term storage systems. In addition, we maintain a wiki page with links to resources on archival storage systems.

  • Archival Workload Studies: We have produced several detailed studies of archival storage user behavior and system evolution. Our studies provide relevant, up-to-date observations on archival system usage patterns to guide and validate future archival storage designs. Some of the key results we've found include weakening oft-quoted "Write-Once, Read-Maybe" assumption, and identifying that the vast majority of archival traffic comes from purely automated sources.
  • Improving Trace Analysis: Our experiences with analyzing long-term traces have highlighted shortcoming in current tracing and analysis techniques. We are using our experience to design new techniques and "best practices" to improve future traces and analyses, such as using traces and metadata snapshots to improve understanding of system state over time, and techniques for discerning between logger failures and full system crashes when activity rates appear unusually low.
  • Economic Modeling of Long-Term Storage: Understanding economics of long-term preservation is necessary because of tremendous data growth, storage density growth slow down, uncertainty in financial investment market conditions, and increasing need for data preservation. Current business models rely on continuous storage density growth and hence cost-per-byte decline. Given the storage density growth slow down, there is a need to reconsider using disks for long-term preservation. Despite their low upfront cost, disks are expensive in long-term because of their high operational costs. It's time we look for alternative technologies which are more cost-effective than disk in long-term.
  • Secure and Searchable Long-Term Storage: As humanity generates ever-increasing amounts of data that must be stored for decades, we must both protect the data from disclosure and allow users to find information. Since long-term storage can potentially suffer from compromised by a single site or person, we distribute data across multiple archive sites, using techniques derived from POTSHARDS. We are investigating techniques that can then allow this data to be searched without revealing search terms or even significant correlation between documents to archive managers, providing a level of privacy necessary for long-term storage of medical records, sensitive corporate and government data, and personal information such as video and photos.


  • Archival Workload Studies: We have recently completed and published several studies of both private and public historical and scientific archives, and are looking towards analysis of a newer dataset obtained from the US Library of Congress.
  • Improving Trace Analysis:In this project we are in the midst of initial proof of concept simulations and analysis, creating artificial snapshots and workloads to better understand the strengths and limitations of our proposed techniques
  • Economic Modeling of Long-Term Storage: We answered these questions in our recent work: What is the effect of technological changes on endowment?, What are the economic impacts of technology choices, over time, on long-term preservation? We are working on other questions like 1) In scenarios of running out of endowment, do we compromise on reliability or redundancy? 2) What are the chances of data loss? 3) On-demand or On-premise access? What is the best device to use (given certain capacity to bandwidth ratio)
  • Secure and Searchable Long-Term Storage: We are working towards publishing our initial work on Percival: a framework that leverages pre-indexing, keyed hashing and Bloom filters to enable blinded searching, blinding the archive from knowing what terms are being queried.
  • Past Projects: The following are projects we have worked on in the past
    Logan: A management system to scalably grow, maintain, and evolve a heterogeneous archival storage system
    Computation-Storage Trade-off: Using provenance to reduce storage overhead by storing intermediate and initial inputs and recomputing a dataset on demand
    Pergamum: long-term evolvable storage built from intelligent network-attached bricks with both disk and NVRAM such as flash.
    Deep Store: building more efficient archival storage using deduplication to take advantage of intra-file and inter-file redundancy.
    POTSHARDS: long-term secure storage, which allows the secure preservation of data for decades without relying upon traditional encryption to prevent information leakage.


Last modified 24 Oct 2017