Deduplication Optimization

1. Frequency-based Chunking for Deduplication Chunking based Data deduplication (dedupe) has becomes a prevalent techniques to support many data driven applications in our daily life since it effectively reduces the data footprint on disks, making the same physical capacity to accommodate a much larger data size. Content based chunking, a stateless chunking deduplication algorithms partitions the long byte steam into a sequence of smaller size data chunks and remove the duplicate ones. However, due to its nature of randomness, content based chunking may suffer high performance variability as well as no performance guarantee. Meanwhile, content based chunking does not consider the appearing frequencies of data chunks while partitioning. However, frequent data chunks have a far reaching impact on the dedupe performance. Intuitively, if a data chunk occurs k times in the byte stream, then its degree of redundancy is high and k-1 copies of the data chunk can be eliminated. On the other hand, if a data chunk only appears once in the byte stream, no gain (space saving) will be obtained.

2. Data Characterization Effects on Deduplication Data deduplication is a data dependent process whose various performance metrics are decided by the input data as well as the algorithms and techniques used in the process. While the algorithmic complexity and technical overheads can be quantified, it has been impossible to quantify just how much the data content really affects the system deduplication performance. This study statistically analyzes how different data sets affect the deduplication metrics such as compression, read/write throughput and deletion overhead. Through this method we hope to quantify the characteristics of data based on its effect on the metrics under interest. Based on these statistics, we hope to provide data deduplication community with set of standardized set of workloads that can be tested for the system evaluation.


1. Frequency-based Chunking We proposed a novel chunking algorithm called Frequency-based Chunking which is able to obtain relatively high occurrence frequencies through chunk partitioning. Through extensive experiments, our scheme is compared against existing content based chunking scheme, and results shows our approach achieves significant better results with respect to space saving, the number of distinct chunks and the average chunk size. We are continuing to investigate issues such as running time performance and scalability for various data sets.

2. Data Characterization Currently we were able to quantify how different characteristics of the original file structure such as size of the files, text versus binary affect the compression and throughput of the data deduplication process. We are currently trying to test how various backup policies also affect this process. We were able to show that the amount of change from backup to backup is not the major characteristic of the data when it comes to throughput or the deletion overhead of the system. Both the locality of the data and the hot/cold characteristics of the data segments must be considered. To this end, we have applied a machine learning technique see if some characteristics of the data can be learned and used to predict future patterns. On the single test set we have it has shown significant improvement to the previous approaches where only the amount of changes are considered.


Last modified 30 Oct 2009