Protecting Against Rare Event Failures in Archival Systems

Appeared in Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2009).

Abstract

Digital archives are growing rapidly, necessitating stronger reliability measures than RAID to avoid data loss from device failure. Mirroring, a popular solution, is too expensive over time. We present a compromise solution that uses multi-level redundancy coding to reduce the probability of data loss from multiple simultaneous device failures. This approach handles small-scale failures of one or two devices efficiently while still allowing the system to survive rare-event, larger-scale failures of four or more devices.

In our approach, each disk is split into a set of fixed size disklets which are used to construct reliability stripes. To protect against rare event failures, reliability stripes are grouped into larger super-groups, each of which has a corresponding super-parity; super-parity is only used to recover data when disk failures overwhelm the redundancy in a single reliability stripe. Super-parity can be stored on a variety of devices such as NV-RAM and always-on disks to offset write bottlenecks while still keeping the number of active devices low.

Our calculations of failure probabilities show that adding super-parity allows our system to absorb many more disk failures without data loss. Through discrete event simulation, we found that adding super-groups has a significant impact on mean time to data loss and that rebuilds are slow but not unmanageable. Finally, we showed that robustness against rare events can be achieved for a fraction of total system cost.

Publication date:
September 2009

Authors:
Avani Wildani
Thomas Schwarz
Ethan L. Miller
Darrell D. E. Long

Projects:
Archival Storage
Reliable Storage

Available media

Full paper text: PDF

Bibtex entry

@inproceedings{wildani-mascots09,
  author       = {Avani Wildani and Thomas Schwarz and Ethan L. Miller and Darrell D. E. Long},
  title        = {Protecting Against Rare Event Failures in Archival Systems},
  booktitle    = {Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2009)},
  month        = sep,
  year         = {2009},
}
Last modified 5 Aug 2020