Protecting Against Rare Event Failures in Archival Systems

Published as Storage Systems Research Center Technical Report UCSC-SSRC-09-03. Preliminary version of a paper that appeared in MASCOTS 2009.


Digital archives are growing rapidly, necessitating stronger reliability measures than RAID to avoid data loss from device failure. Mirroring, a popular solution, is too expensive over time. We present a compromise solution that uses multi-level redundancy coding to reduce the probability of data loss from multiple simultaneous device failures. This approach handles small-scale failures of one or two devices efficiently while still allowing the system to survive rare-event, larger-scale failures of four or more devices. In our approach, each disk is split into a set of fixed size disklets which are used to construct reliability stripes. To protect against rare event failures, reliability stripes are grouped into larger "uber-groups," each of which has a corresponding "uber-parity;'' uber-parity is only used to recover data when disk failures overwhelm the redundancy in a single reliability stripe. Uber-parity can be stored on a variety of devices such as NV-RAM and always-on disks to offset write bottlenecks while still keeping the number of active devices low. Our calculations of failure probabilities found that the addition of uber-groups allowed the system to absorb many more disk failures without data loss. Through discrete event simulation, we found that adding uber-groups only negatively impacts performance when these groups need to be used for a rebuild. Since rebuilds using uber-parity occur very rarely, they minimally impact system performance over time. Finally, we showed that robustness against rare events can be achieved for under 5% of total system cost.

Publication date:
April 2009

Avani Wildani
Thomas Schwarz
Ethan L. Miller
Darrell D. E. Long

Archival Storage
Reliable Storage

Available for download:

Full text:
Download as PDF

Bibtex entry

  author       = {Avani Wildani and Thomas Schwarz and Ethan L. Miller and Darrell
D. E. Long},
  title        = {Protecting Against Rare Event Failures in  Archival Systems},
  institution  = {University of California, Santa Cruz},
  number       = {UCSC-SSRC-09-03},
  month        = apr,
  year         = {2009},
Last modified 31 May 2009