Making Sense of File Systems Through Provenance and Rich Metadata

Published as Storage Systems Research Center Technical Report UCSC-SSRC-12-01.


Modern high end computing systems store hundreds of petabytes of data and have billions of files, as many files as the internet of only a few years ago. Even modern personal computers store numbers of files that would be massive for the largest mainframe computers of 40 years ago. The quantities of data in modern computing have long since overwhelmed anyone’s ability to manage it manually, and the 40 year old tools currently in use for file finding and management are reaching the limits of scale. In an environment like this, secure, effective, and efficient search algorithms and automatic file management become a necessity, not a nicety. Our proposal addresses the question of how users can quickly find and manage files, without burdening the file system with expensive brute force searches, or requiring the user to become an expert in query languages. We propose a number of algorithms to improve file management in a large scale scientific computing environment. By collecting new metadata, including file system provenance, we propose to provide new ranking algorithms which are efficient and effective on large multi-user file systems. We intend to reduce the burden of file naming, allowing the system to generate expressive, unique file names on the fly; we have identified a statistical property of data that is likely to select meaningful attributes for file names. And since security is a concern on many large scientific computing systems, we intend to analyze the security properties of the proposed ranking algorithms, and demonstrate how our ranking algorithm degrades gracefully from the ideal ranking when applied in a setting with restrictive security permissions. We will validate our results using real world scientific data, and provide statistical analyses of rich metadata and provenance from this data. And we will validate our ranking and naming algorithms through a series of in situ user studies. Modern data management must be automatic and scalable, allowing users and file systems to focus on what each does best. By exploiting patterns of human behavior, the system can provide faster searches and more interpretable interfaces to the file system. Data growth is not expected to level off anytime soon, and file systems must be ready to handle the load.

Publication date:
March 2012

Aleatha Parker-Wood
Darrell D. E. Long
Ethan L. Miller
Margo Seltzer
Daniel Tunkelang

Scalable File System Indexing
Dynamic Non-Hierarchical File Systems
HECURA: Scalable Data Management

Available media

Full paper text: PDF

Bibtex entry

  author       = {Aleatha Parker-Wood and Darrell D. E. Long and Ethan L. Miller and Margo Seltzer and Daniel Tunkelang},
  title        = {Making Sense of File Systems Through Provenance and Rich Metadata},
  institution  = {University of California, Santa Cruz},
  number       = {UCSC-SSRC-12-01},
  month        = mar,
  year         = {2012},
Last modified 24 May 2019