Scalable File System Indexing @ SSRC

Scalable File System Indexing

This project is no longer active. Information is still available below.

As the number and variety of files stored and accessed by users dramatically increases, existing file system structures have begun to fail as a mechanism for managing all of the information contained in those files. Many applications, such as email clients, multimedia management applications, and desktop search engines, have been forced to develop their own richer metadata infrastructures. While effective, these solutions are generally non-standard, non-portable, and potentially non-scalable. These issues suggest search, indexing, and information retrieval are becoming increasingly important areas for file and storage systems. In conjunction with faculty and students specializing in information retrieval at the UC Santa Cruz Department for Information Systems and Technology Management, we are developing system architectures that address these issues, which are scalable up to billions of files.

Status

Our current areas of focus are scalable indexing architectures for storage systems, improved file system namespaces, and incorporating concepts from databases and information retrieval, such as ranked search and more intelligent indexes into file systems. We particularly emphasize queries over extended and user-supplied metadata, such as scientific metadata and document metadata.

We are exploring new file system designs where search is first-class functionality rather than an after thought. The current approach of using a search index in addition to the file system's index requires two large and separate index structures to be maintained. This separation forces users and applications to access and update two structures when using their data. Our file system designs take a new approach to internal file system structures, layouts, and logging that are search optimized. Our new design can improve search performance, allow data layouts based on how files are queried, and improve efficiency by reducing the number of index structures that must be maintained. We are in the process of implementing these concepts in the Ceph distributed file system.

In addition, we are doing active research into effective ways of partitioning metadata into indexes. A partitioned metadata index can rule out irrelevant files and quickly focus on files that are more likely to match the search criteria. By integrating partitioning with security criteria, we have been able to design a highly scalable design for scalable metadata search. This allows us to eliminate files that the querier cannot view without ever loading those indexes from storage. We have implemented and tested this system. Our results are available in the proceedings of the 26th IEEE Symposium on Massive Storage Systems and Technologies.

Beyond new mechanisms for indexing and metadata management, we are investigating new indexing policies that are driven by user demand. These policies scale an index's granularity and availability to strike a balance between CPU and storage utilization.

Our previous work in this area includes work done in collaboration with NetApp. Our metadata search index, Spyglass, leverages the characteristics unique to storage systems, such as data distributions and hierarchical namespaces, to design new search and indexing algorithms. Our design has search performance that can outperform basic DBMS-based solutions by up to four orders of magnitude, allows time-traveling queries over versioned metadata, and can efficiently re-crawl very large file systems.

We have also previously done work on a file system query language, QUASAR, that allows users to have powerful semantic access to stored data. QUASAR allows semantic file system views and directories to be created, which provide more meaningful data representations. Inter-file relationships, such as provenance, can be expressed and searched through links. To aid browsing, we are investigating applying faceted search to QUASAR. Faceted search uses rich key-value metadata to allow users to interactively navigate the search space and can allow interfaces to be automatically personalized for each user.

Faculty

Alumni

Publications

Date		Publication
Jun 10, 2014		Aleatha Parker-Wood, Darrell D. E. Long, Ethan L. Miller, Philippe Rigaux, Andy Isaacson, A File By Any Other Name: Managing File Names with Metadata, Proceedings of the 7th International Systems and Storage Conference (SYSTOR '14), June 2014. [Scalable File System Indexing]
Jun 30, 2013		Aleatha Parker-Wood, Brian Madden, Michael McThrow, Darrell D. E. Long, Ian Adams, Avani Wildani, Examining Extended and Scientific Metadata for Scalable Index Designs, Proceedings of the 6th International Systems and Storage Conference (SYSTOR 2013), June 2013. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems]
Jan 28, 2013		Thomas Schwarz, Ignacio Corderi, Darrell D. E. Long, Jehan-François Pâris, Simple, Exact Placement of Data in Containers, Proceedings of the International Conference on Computing, Networking and Communications (ICNC), January 2013. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems]
Dec 14, 2012		Aleatha Parker-Wood, Brian Madden, Michael McThrow, Darrell D. E. Long, Examining Extended and Scientific Metadata for Scalable Index Designs, Technical Report UCSC-SSRC-12-07, December 2012. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems] [Ultra-Large Scale Storage]
Oct 29, 2012		Yulai Xie, Kiran-Kumar Muniswamy-Reddy, Dan Feng, Yan Li, Darrell D. E. Long, Zhipeng Tan, Lei Chen, A Hybrid Approach for Efficient Provenance Storage, The 21st ACM Conference on Information and Knowledge Management (CIKM), October 2012. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems] [Ultra-Large Scale Storage]
Mar 2, 2012		Aleatha Parker-Wood, Darrell D. E. Long, Ethan L. Miller, Margo Seltzer, Daniel Tunkelang, Making Sense of File Systems Through Provenance and Rich Metadata, Technical Report UCSC-SSRC-12-01, March 2012. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems] [HECURA: Scalable Data Management]
Nov 6, 2011		Stephanie Jones, Christina Strong, Aleatha Parker-Wood, Alexandra Holloway, Darrell D. E. Long, Easing the Burdens of HPC File Management, Proceedings of the 6th Parallel Data Storage Workshop (PDSW '11), November 2011. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems]
Sep 12, 2011		Christina Strong, Stephanie Jones, Aleatha Parker-Wood, Alexandra Holloway, Darrell D. E. Long, Los Alamos National Laboratory Interviews, Technical Report UCSC-SSRC-11-06, September 2011. [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems] [HECURA: Scalable Data Management] [Ultra-Large Scale Storage]
Jun 20, 2011		Stephanie Jones, Christina Strong, Darrell D. E. Long, Ethan L. Miller, Tracking Emigrant Data via Transient Provenance, Proceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance (TaPP '11), June 2011. [Secure File and Storage Systems] [Scalable File System Indexing] [Dynamic Non-Hierarchical File Systems]
May 6, 2010		Aleatha Parker-Wood, Christina Strong, Ethan L. Miller, Darrell D. E. Long, Security Aware Partitioning for Efficient File System Search, 26th IEEE Symposium on Massive Storage Systems and Technologies: Research Track (MSST 2010), May 2010. [Scalable File System Indexing] [HECURA: Scalable Data Management] [Ultra-Large Scale Storage] [Prediction and Grouping]
Dec 10, 2009		Andrew Leung, Organizing, Indexing, and Searching Large-Scale File Systems, Technical Report UCSC-SSRC-09-09, December 2009. [Scalable File System Indexing] [HECURA: Scalable Data Management] [Ultra-Large Scale Storage]
Nov 13, 2009		Andrew Leung, Ian Adams, Ethan L. Miller, Magellan: A Searchable Metadata Architecture for Large-Scale File Systems, Technical Report UCSC-SSRC-09-07, November 2009. [Scalable File System Indexing] [HECURA: Scalable Data Management] [Ultra-Large Scale Storage]
Oct 8, 2009		Andrew Leung, Aleatha Parker-Wood, Ethan L. Miller, Copernicus: A Scalable, High-Performance Semantic File System, Technical Report UCSC-SSRC-09-06, October 2009. [Scalable File System Indexing] [Ultra-Large Scale Storage]
Jun 1, 2009		Andrew Leung, Minglong Shao, Timothy Bisson, Shankar Pasupathy, Ethan L. Miller, Spyglass: Metadata Search for Large-Scale Storage Systems, ;login: — The USENIX Magazine 34(3), June 2009. [Scalable File System Indexing] [Ultra-Large Scale Storage]
Feb 24, 2009		Andrew Leung, Minglong Shao, Timothy Bisson, Shankar Pasupathy, Ethan L. Miller, Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems, Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST '09), February 2009. [Scalable File System Indexing] [Ultra-Large Scale Storage]
Nov 17, 2008		Andrew Leung, Ethan L. Miller, Scalable Full-Text Search for Petascale File Systems, Proceedings of the 2008 Petascale Data Storage Workshop (PDSW 08), November 2008. [Scalable File System Indexing] [Ultra-Large Scale Storage]
Oct 5, 2008		Sasha Ames, Carlos Maltzahn, Ethan L. Miller, Quasar: A Scalable Naming Language for Very Large File Collections, Technical Report UCSC-SSRC-08-04, October 2008. [Scalable File System Indexing]
Sep 22, 2008		David Pease, Darrell D. E. Long, Future File Systems, Proceedings of Computing with Massive and Persistent Data, September 2008. [Scalable File System Indexing]
Sep 15, 2008		Sasha Ames, Carlos Maltzahn, Ethan L. Miller, QUASAR: Interaction with File Systems Using a Query and Naming Language, Technical Report UCSC-SSRC-08-03, September 2008. [Scalable File System Indexing]
Jul 14, 2008		Andrew Leung, Minglong Shao, Timothy Bisson, Shankar Pasupathy, Ethan L. Miller, High-Performance Metadata Indexing and Search in Petascale Data Storage Systems, Proceedings of the SciDAC 2008 Conference, July 2008. [Scalable File System Indexing]
Jun 25, 2008		Andrew Leung, Shankar Pasupathy, Garth Goodson, Ethan L. Miller, Measurement and Analysis of Large-Scale Network File System Workloads, Proceedings of the 2008 USENIX Technical Conference, June 2008. [Scalable File System Indexing] [Tracing and Benchmarking] [Ultra-Large Scale Storage]
May 21, 2008		Andrew Leung, Minglong Shao, Timothy Bisson, Shankar Pasupathy, Ethan L. Miller, Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems, Technical Report UCSC-SSRC-08-01, May 2008. [Scalable File System Indexing] [Ultra-Large Scale Storage]
Apr 22, 2008		Jonathan Koren, Yi Zhang, Xue Liu, Personalized Interactive Faceted Search, Proceedings of the 17th International Conference on the World Wide Web (WWW 2008), April 2008. [Scalable File System Indexing]
Nov 11, 2007		Jonathan Koren, Yi Zhang, Sasha Ames, Andrew Leung, Carlos Maltzahn, Ethan L. Miller, Searching and Navigating Petabyte Scale File Systems Based on Facets, Proceedings of the 2007 ACM Petascale Data Storage Workshop (PDSW 07), November 2007. [Scalable File System Indexing]
Aug 12, 2007		Deepavali Bhagwat, Kave Eshghi, Pankaj Mehra, Content-based Document Routing and Index Partitioning for Scalable Similarity-based Searches in a Large Corpus, Proceedings of the 13th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD '07), August 2007, pages 105-112. [Archival Storage] [Scalable File System Indexing] [Deduplication]
Jul 23, 2007		Yi Zhang, Jonathan Koren, Efficient Bayesian Hierarchical User Modeling for Recommendation Systems, Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '07), July 2007, pages 47-54. [Scalable File System Indexing]
Mar 25, 2007		Carlos Maltzahn, Nikhil Bobb, Mark W. Storer, Damian Eads, Scott A. Brandt, Ethan L. Miller, Graffiti: A Framework for Testing Collaborative Distributed Metadata, Proceedings in Informatics 21, March 2007, pages 97–111. [Storage Class Memories] [Scalable File System Indexing]
Jan 23, 2007		Mark W. Storer, Graffiti Server - Design and Implementation, Technical Report UCSC-SSRC-07-02, January 2007. [Storage Class Memories] [Scalable File System Indexing]
May 1, 2003		Karthik Thirumalai, Jehan-François Pâris, Darrell D. E. Long, Tabbycat: an Inexpensive Scalable Server for Video-on-Demand, Proceedings of the IEEE 2003 International Conference on Communications, May 2003. [Scalable File System Indexing]
Apr 1, 2003		Gary Whittle, Jehan-François Pâris, Ahmed Amer, Darrell D. E. Long, Randal Burns, Using Multiple Predictors to Improve the Accuracy of File Access Predictions, Proceedings of the Twentieth Symposium on Mass Storage Systems, April 2003. [Scalable File System Indexing]
Jun 1, 2002		Tsozen Yeh, Darrell D. E. Long, Scott A. Brandt, Increasing Predictive Accuracy by Prefetching Multiple Program and User Specific Files, Proceedings of the Sixteenth Annual International Symposium on High Performance Computing Systems and Applications, June 2002. [Scalable File System Indexing]
May 1, 2001		Tsozen Yeh, Darrell D. E. Long, Scott A. Brandt, Conserving Battery Energy through Making Fewer Incorrect File Predictions, Proceedings of the IEEE Workshop on Power Management for Real-Time and Embedded Systems, May 2001. [Scalable File System Indexing]
Apr 1, 2001		Randal Burns, Robert M. Rees, Darrell D. E. Long, An Analytical Study of Opportunistic Lease Renewal, Proceedings of the Twenty-first International Conference on Distributed Computing Systems, April 2001. [Scalable File System Indexing]
Apr 1, 2001		Steven W. Carter, Jehan-François Pâris, Saurabh Mohan, Darrell D. E. Long, A Dynamic Heuristic Broadcasting Protocol for Video-on-Demand, Proceedings of the Twenty-first International Conference on Distributed Computing Systems , April 2001. [Scalable File System Indexing]
Jun 1, 2000		Randal Burns, Robert M. Rees, Darrell D. E. Long, Consistency and Locking for Distributing Updates to Web Servers Using a File System, Proceedings of Performance and Architecture of Web Servers, June 2000. [Scalable File System Indexing]
May 1, 2000		Randal Burns, Robert M. Rees, Darrell D. E. Long, Safe Caching in a Distributed File System for Network Attached Storage, Proceedings of the International Parallel and Distributed Processing Symposium, May 2000. [Scalable File System Indexing]
Feb 1, 2000		Randal Burns, Robert M. Rees, Darrell D. E. Long, Semi-Preemptible Locks for a Distributed File System, Proceedings of the International Performance Conference on Computers and Communication, February 2000. [Scalable File System Indexing]
Feb 1, 1999		Ted Haining, Darrell D. E. Long, Management Policies for Non-volatile Write Caches, Proceedings of the International Performance Conference on Computers and Communication, February 1999. [Scalable File System Indexing]
Dec 1, 1998		Timothy Gibson, Ethan L. Miller, Darrell D. E. Long, Long-term File Activity and Inter-reference Patterns, Proceedings of the Computer Measurement Group Conference, December 1998. [Scalable File System Indexing]
Jul 1, 1996		Benjamin C. Reed, Darrell D. E. Long, Analysis of Caching Algorithms for Distributed File Systems, Operating Systems Review, July 1996. [Scalable File System Indexing]
Jan 1, 1994		Ivan Fellner, Ivan Racko, Milos Racek, Karol Fabian, Darrell D. E. Long, A Comparison of Two Implementations of the Token Ring Priority Function, International Journal in Computer Simulation , January 1994. [Scalable File System Indexing]
Jul 1, 1991		Richard Golding, Darrell D. E. Long, Accessing Replicated Data in a Large-Scale Distributed System, International Journal in Computer Simulation 1(4), July 1991, pages 347-372. [Scalable File System Indexing] [HECURA: Scalable Data Management]
Mar 1, 1988		Jehan-François Pâris, Darrell D. E. Long, Alexander Glockner, A Realistic Evaluation of Consistency Algorithms for Replicated Files, Proceedings of the Twenty-first Annual Simulation Symposium, March 1988. [Scalable File System Indexing]
Sep 1, 1987		John L. Carroll, Darrell D. E. Long, Jehan-François Pâris, Block-Level Consistency of Replicated Files, Proceedings of the International Conference on Distributed Computing Systems, September 1987. [Scalable File System Indexing]
Mar 1, 1987		Darrell D. E. Long, Jehan-François Pâris, On Improving the Availability of Replicated Files, Proceedings of the Symposium on Reliability in Distributed Software and Database Systems, March 1987. [Scalable File System Indexing]

Last modified 19 Oct 2020