Deduplication for Virtual Machine Disk Images

Published as Storage Systems Research Center Technical Report UCSC-SSRC-10-01.

Abstract

Virtual machines are becoming widely used in both desktop and variable servers to efficiently provide many logically separate execution environments while reducing the need for physical machines. While this approach utilizes free physical CPU resources, it still consumes large amounts of storage because each virtual machine (VM) instance requires its own multi-gigabyte disk image. Moreover, existing systems do not support ad hoc block sharing between disk images, instead relying on techniques such as overlays to build multiple VMs from a single “base” image.

Deduplication is a commonly used technique in archival storage systems and virtualization architectures. The concept of deduplication is similar to data compression, that finds identical instances of data blocks in a storage repository, and removes all such instances but one. Indexing is also employed to enable high-performance global identity detection. In an archival storage system, deduplication is an ideal approach to save disk I/O as well as storage space. In virtual machine host centers, deduplication causes homogeneous operating systems to share not only file system data, but also common memory pages. Furthermore, by identifying duplicate data within a spatial locality of each chunk, extra space saving can be achieved without much extra time invested.

To test the effectiveness of deduplication, we conducted extensive evaluations on different sets of virtual machine disk images with different chunking strategies. Our experiments found that the amount of stored data grows very slowly after the first few virtual disk images if only the locale or software configuration is changed, with the rate of compression suffering when different versions of an operating system or different operating systems are included. We also show that fixed-length chunks work well, achieving nearly the same compression rate as variable-length chunks. We also show that simply identifying zero-filled blocks, even in ready-to-use virtual machine disk images available online, can provide significant savings in storage. Finally, we propose an approach to incorporate delta encoding into regular deduplication as a post-processing step. Experimental results indicate as much space savings from delta encoding as from deduplication for certain virtual machines, while the extra time consumption is low.

Publication date:
September 2010

Authors:
Keren Jin

Projects:
Deduplication

Available media

Full paper text: PDF

Bibtex entry

@techreport{jin-tr10,
  author       = {Keren Jin},
  title        = {Deduplication for Virtual Machine Disk Images},
  institution  = {University of California, Santa Cruz},
  number       = {UCSC-SSRC-10-01},
  month        = sep,
  year         = {2010},
}
Last modified 28 May 2019