Efficient Provenance Management via Clustering and Hybrid Storage in Big Data Environments

Appeared in IEEE transactions on big data .


Provenance is a type of metadata that records the creation and transformation of data objects. It has been applied to a wide
variety of areas such as security, search, and experimental documentation. However, provenance usually has a vast amount of data
with its rapid growth rate which hinders the effective extraction and application of provenance. This paper proposes an efficient
provenance management system via clustering and hybrid storage. Specifically, we propose a Provenance-Based Label Propagation
Algorithm which is able to regularize and cluster a large number of irregular provenance. Then, we use separate physical storage
mediums, such as SSD and HDD, to store hot and cold data separately, and implement a hot/cold scheduling scheme which can
update and schedule data between them automatically. Besides, we implement a feedback mechanism which can locate and compress
the rarely used cold data according to the query request. The experimental test shows that the system can significantly improve
provenance query performance with a small run-time overhead.

Publication date:
March 2019

Die Hu
Dan Feng
Yulai Xie
Gongming Xu
Xinrui Gu
Darrell D. E. Long


Bibtex entry

  author       = {Die Hu and Dan Feng and Yulai Xie and Gongming Xu and Xinrui Gu and Darrell D. E. Long},
  title        = {Efficient Provenance Management via Clustering and Hybrid Storage in Big Data Environments},
  journal      = {IEEE transactions on big data},
  volume       = {},
  month        = mar,
  year         = {2019},
Last modified 15 Jul 2020