2008-FAST-Avoiding the Disk Bottleneck in the Data Domain Deduplication File System

更新时间:2023-08-11 03:58:01 阅读量: 人文社科 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

FAST有关论文。。

Avoiding the Disk Bottleneck in the Data Domain Deduplication File System

Benjamin Zhu Data Domain, Inc.

Kai Li

Data Domain, Inc. and Princeton University

Hugo Patterson Data Domain, Inc.

Abstract

Disk-based deduplication storage has emerged as the new-generation storage system for enterprise data protection to replace tape libraries. Deduplication removes redundant data segments to compress data into a highly compact form and makes it economical to store backups on disk instead of tape. A crucial requirement for enterprise data protection is high throughput, typically over 100 MB/sec, which enables backups to complete quickly. A significant challenge is to identify and eliminate duplicate data segments at this rate on a low-cost system that cannot afford enough RAM to store an index of the stored segments and may be forced to access an on-disk index for every input segment.

This paper describes three techniques employed in the production Data Domain deduplication file system to relieve the disk bottleneck. These techniques include: (1) the Summary Vector, a compact in-memory data structure for identifying new segments; (2) Stream-Informed Segment Layout, a data layout method to improve on-disk locality for sequentially accessed segments; and (3) Locality Preserved Caching, which maintains the locality of the fingerprints of duplicate segments to achieve high cache hit ratios. Together, they can remove 99% of the disk accesses for deduplication of real world workloads. These techniques enable a modern two-socket dual-core system to run at 90% CPU utilization with only one shelf of 15 disks and achieve 100 MB/sec for single-stream throughput and 210 MB/sec for multi-stream throughput.

1 Introduction

The massive storage requirements for data protection have presented a serious problem for data centers. Typically, data centers perform a weekly full backup of all the data on their primary storage systems to secondary storage devices where they keep these backups for weeks to months. In addition, they may perform daily incremental backups that copy only the data which has changed since the last backup. The frequency, type and retention of backups vary for different kinds of data, but it is common for the secondary storage to hold 10 to 20 times more data than the primary storage. For disaster recovery, additional offsite copies may double the secondary storage capacity needed. If the data is transferred offsite over a wide area network, the network bandwidth requirement can be enormous.

Given the data protection use case, there are two main requirements for a secondary storage system storing backup data. The first is low cost so that storing backups and moving copies offsite does not end up costing significantly more than storing the primary data. The second is high performance so that backups can complete in a timely fashion. In many cases, backups must complete overnight so the load of performing backups does not interfere with normal daytime usage.

The traditional solution has been to use tape libraries as secondary storage devices and to transfer physical tapes for disaster recovery. Tape cartridges cost a small fraction of disk storage systems and they have good sequential transfer rates in the neighborhood of 100 MB/sec. But, managing cartridges is a manual process that is expensive and error prone. It is quite common for restores to fail because a tape cartridge cannot be located or has been damaged during handling. Further, random access performance, needed for data restores, is extremely poor. Disk-based storage systems and network replication would be much preferred if they were affordable.

During the past few years, disk-based, “deduplication” storage systems have been introduced for data protection [QD02, MCM01, KDLT04, Dat05, JDT05]. Such systems compress data by removing duplicate data across files and often across all the data in a storage system. Some implementations achieve a 20:1 compression ratio (total data size divided by physical space used) for 3 months of backup data using the daily-incremental and weekly-full backup policy. By substantially reducing the footprint of versioned data, deduplication can make the costs of storage on disk and tape comparable and make replicating data over a WAN to a remote site for disaster recovery practical.

USENIX AssociationFAST ’08: 6th USENIX Conference on File and Storage Technologies

269

FAST有关论文。。

The specific deduplication approach varies among system vendors. Certainly the different approaches vary in how effectively they reduce data. But, the goal of this paper is not to investigate how to get the greatest data reduction, but rather how to do deduplication at high speed in order to meet the performance requirement for secondary storage used for data protection.

compete with high-end tape libraries. Achieving 100 MB/sec, would require 125 disks doing index lookups in parallel! This would increase the system cost of deduplication storage to an unattainable level.

Our key idea is to use a combination of three methods to reduce the need for on-disk index lookups during the deduplication process. We present in detail each of the

The most widely used deduplication method for three techniques used in the production Data Domain secondary storage, which we call Identical Segment deduplication file system. The first is to use a Bloom Deduplication, breaks a data file or stream into filter, which we call a Summary Vector, as the summary contiguous segments and eliminates duplicate copies of data structure to test if a data segment is new to the identical segments. Several emerging commercial system. It avoids wasted lookups for segments that do systems have used this approach. not exist in the index. The second is to store data

segments and their fingerprints in the same order that

The focus of this paper is to show how to implement a

they occur in a data file or stream. Such Stream-Informed

high-throughput Identical Segment Deduplication

Segment Layout (SISL) creates spatial locality for

storage system at low system cost. The key performance

segment and fingerprint accesses. The third, called

challenge is finding duplicate segments. Given a segment

Locality Preserved Caching, takes advantage of the

size of 8 KB and a performance target of 100 MB/sec, a

segment layout to fetch and cache groups of segment

deduplication system must process approximately 12,000

fingerprints that are likely to be accessed together. A

segments per second.

single disk access can result in many cache hits and thus avoid many on-disk index lookups. An in-memory index of all segment fingerprints could

easily achieve this performance, but the size of the index

Our evaluation shows that these techniques are effective

would limit system size and increase system cost.

in removing the disk bottleneck in an Identical Segment

Consider a segment size of 8 KB and a segment

Deduplication storage system. For a system running on a

fingerprint size of 20 bytes. Supporting 8 TB worth of

server with two dual-core CPUs with one shelf of 15

unique segments, would require 20 GB just to store the

drives, these techniques can eliminate about 99% of

fingerprints.

index lookups for variable-length segments with an average size of about 8 KB. We show that the system An alternative approach is to maintain an on-disk index

indeed delivers high throughput: achieving over 100 of segment fingerprints and use a cache to accelerate

MB/sec for single-stream write and read performance, segment index accesses. Unfortunately, a traditional

and over 210 MB/sec for multi-stream write cache would not be effective for this workload. Since

performance. This is an order-of-magnitude throughput fingerprint values are random, there is no spatial locality

improvement over the parallel indexing techniques in the segment index accesses. Moreover, because the

presented in the Venti system. backup workload streams large data sets through the

system, there is very little temporal locality. Most

The rest of the paper is organized as follows. Section 2

segments are referenced just once every week during the

presents challenges and observations in designing a

full backup of one particular system. Reference-based

deduplication storage system for data protection. Section

caching algorithms such as LRU do not work well for

3 describes the software architecture of the production

such workloads. The Venti system, for example,

Data Domain deduplication file system. Section 4

implemented such a cache [QD02]. Its combination of

presents our methods for avoiding the disk bottleneck.

index and block caches only improves its write

Section 5 shows our experimental results. Section 6

throughput by about 16% (from 5.6MB/sec to

gives an overview of the related work, and Section 7

6.5MB/sec) even with 8 parallel disk index lookups. The

draws conclusions.

primary reason is due to its low cache hit ratios. With low cache hit ratios, most index lookups require disk operations. If each index lookup requires a disk access which may take 10 msec and 8 disks are used for index lookups in parallel, the write throughput will be about 6.4MB/sec, roughly corresponding to Venti’s throughput of less than 6.5MB/sec with 8 drives. While Venti’s performance may be adequate for the archival usage of a small workgroup, it’s a far cry from the throughput goal of deduplicating at 100 MB/sec to

2 Challenges and Observations

2.1 Variable vs. Fixed Length Segments

An Identical Segment Deduplication system could choose to use either fixed length segments or variable length segments created in a content dependent manner. Fixed length segments are the same as the fixed-size blocks of many non-deduplication file systems. For the purposes of this discussion, extents that are multiples of

270

FAST ’08: 6th USENIX Conference on File and Storage TechnologiesUSENIX Association

FAST有关论文。。

USENIX AssociationFAST ’08: 6th USENIX Conference on File and Storage Technologies

271

FAST有关论文。。

272

FAST ’08: 6th USENIX Conference on File and Storage TechnologiesUSENIX Association

FAST有关论文。。

sequence of client data bytes and has intrinsic and client-settable attributes or metadata. An objec

t may be a conventional file, a backup image of an entire volume or a tape cartridge.

To write a range of bytes into an object, Content Store performs several operations.

Anchoring partitions the byte range into variable-length segments in a cccontent dependent manner c[Man93, BDH94].

Segment fingerprinting computes the SHA-1 hash and generates the segment descriptor based on it. Eah segment desriptor ontains per segment

information of at least fingerprint and size

Segment mapping builds the tree of segments that

records the mapping between object byte ranges and segment descriptors. The goal is to represent a data object using references to deduplicated segments.

c

Figure 2: Containers are self-describing, immutable, units of storage several megabytes in size. All segments are stored in containers.

3.2 Segment Store

To read a range of bytes in an object, Content Store traverses the tree of segments created by the segment mapping operation above to obtain the segment descriptors for the relevant segments. It fetches the segments from Segment Store and returns the requested byte range to the client.

ccc

cc

Segment lookup finds the ontainer storing the requested segment. This operation may trigger disk I/Os to look in the on-disk index, thus it is throughput sensitive. Container retrieval reads the relevant portion of the indiated ontainer by invoking the Container Manager.

Container unpacking decompresses the retrieved portion of the container and returns the requested data segment.

Segment Store is essentially a database of segments keyed by their segment descriptors. To support writes, it accepts segments with their segment descriptors and stores them. To support reads, it fethes segments designated by their segment descriptors.

To write a data segment, Segment Store performs several operations.

Segment filtering determines if a segment is a duplicate. This is the key operation to deduplicate segments and may trigger disk I/Os, thus its overhead an signifiantly impat throughput performance.

Container packing adds segments to be stored to a container which is the unit of storage in the system. The packing operation also compresses segment data using a variation of the Ziv-Lempel algorithm. A container, when fully packed, is appended to the Container Manager.

Segment Indexing updates the segment index that maps segment descriptors to the container holding the segment, after the container has been appended to the Container Manager.

3.3 Container Manager

The Container Manager provides a storage container log abstraction, not a block abstraction, to Segment Store. Containers, shown in Figure 2, are self-describing in that a metadata section includes the segment descriptors for the stored segments. They are immutable in that new containers can be appended and old containers deleted, but containers cannot be modified once written. When Segment Store appends a ontainer, the Container Manager returns a container ID which is unique over the life of the system.

The Container Manager is responsible for allocating, dealloating, reading, writing and reliably storing containers. It supports reads of the metadata section or a portion of the data section, but it only supports appends of whole containers. If a container is not full but needs to be written to disk, it is padded out to its full size. Container Manager is built on top of standard block storage. Advanced techniques such as Software RAID-6, continuous data scrubbing, container verification, and end to end data checks are applied to ensure a high level of data integrity and reliability.

The container abstraction offers several benefits.

To read a data segment, Segment Store performs the following operations.

USENIX AssociationFAST ’08: 6th USENIX Conference on File and Storage Technologies

273

FAST有关论文。。

274

FAST ’08: 6th USENIX Conference on File and Storage TechnologiesUSENIX Association

FAST有关论文。。

Summary Vector to disk. To recover, the system loads the most recent checkpoint of the Summary Vector and then processes the containers appended to the container log since the checkpoint, adding the contained segments to the Summary Vector.

Although several variations of Bloom filters have been proposed during the past few years [BM05], we have chosen the basic Bloom Filter for simplicity and efficient implementation.

the stream. Because multiple streams can write to Segment Store in parallel, there may be multiple open containers, one for each active stream.

The end result is Stream-Informed Segment Layout or SISL, because for a data stream, new data segments are stored together in the data sections, and their segment descriptors are stored together in the metadata section. SISL offers many benefits.

When multiple segments of the same data stream are written to a container together, many fewer disk I/Os are needed to reconstruct the stream which helps the system achieve high read throughput.

Descriptors and compressed data of adjacent new segments in the same stream are packed linearly in the metadata and data sections respectively in the same container. This packing captures duplicate locality for future streams resembling this stream, and enables Locality Preserved Caching to work effectively.

The metadata section is stored separately from the data section, and is generally much smaller than the data section. For example, a container size of 4 MB, an average segment size of 8 KB, and a Ziv-Lempel compression ratio of 2, yield about 1K segments in a container, and require a metadata section size of just about 64 KB, at a segment descriptor size of 64 bytes. The small granularity on container metadata section reads allows Locality Preserved Caching in a highly efficient manner: 1K segments can be cached using a single small disk I/O. This contrasts to the old way of one on-disk index lookup per segment.

4.2 Stream-Informed Segment Layout

We use Stream-Informed Segment Layout (SISL) to create spatial locality for both segment data and segment descriptors and to enable Locality Preserved Caching as described in the next section. A stream here is just the sequence of bytes that make up a backup image stored in a Content Store object.

Our main observation is that in backup applications, segments tend to reappear in the same of very similar sequences with other segments. Consider a 1 MB file with a hundred or more segments. Every time that file is backed up, the same sequence of a hundred segments will appear. If the file is modified slightly, there will be some new segments, but the rest will appear in the same order. When new data contains a duplicate segment x, there is a high probability that other segments in its locale are duplicates of the neighbors of x. We call this property segment duplicate locality. SISL is designed to preserve this locality.

Content Store and Segment Store support a stream abstraction that segregates the segments created for different objects, preserves the logical ordering of segments within the Content Store object, and dedicates containers to hold segments for a single stream in their logical order. The metadata sections of these containers store segment descriptors in their logical order. Multiple streams can be written to Segment Store in parallel, but the stream abstraction prevents the segments for the different streams from being jumbled together in a container.

The design decision to make the deduplication storage system stream aware is a significant distinction from other systems such as Venti.

When an object is opened for writing, Content Store opens a corresponding stream with Segment Store which in turn assigns a container to the stream. Content Store writes ordered batches of segments for the object to the stream. Segment Store packs the new segments into the data section of the dedicated container, performs a variation of Ziv-Lempel compression on the data section, and writes segment descriptors into the metadata section of the container. When the container fills up, it appends it with Container Manager and starts a new container for

These advantages make SISL an effective mechanism for deduplicating multiple-stream fine-grained data segments. Packing containers in a stream aware fashion distinguishes our system from Venti and many other systems.

4.3 Locality Preserved Caching

We use Locality Preserved Caching (LPC) to accelerate the process of identifying duplicate segments.

A traditional cache does not work well for caching fingerprints, hashes, or descriptors for duplicate detection because fingerprints are essentially random. Since it is difficult to predict the index location for next segment without going through the actual index access again, the miss ratio of a traditional cache will be extremely high.

We apply LPC to take advantage of segment duplicate locality so that if a segment is a duplicate, the base segment is highly likely cached already. LPC is achieved

USENIX AssociationFAST ’08: 6th USENIX Conference on File and Storage Technologies

275

FAST有关论文。。

by combining the container abstraction with a segment cache as discussed next.

For segments that cannot be resolved by the Summary Vector and LPC, we resort to looking up the segment in the segment index. We have two goals on this retrieval:

Making this retrieval a relatively rare occurrence. Whenever the retrieval is made, it benefits segment filtering of future segments in the locale.

4.4 Accelerated Segment Filtering

We have combined all three techniques above in the segment filtering phase of our implementation.

For an incoming segment for write, the algorithm does the following:

Checks to see if it is in the segment cache. If it is in the cache, the incoming segment is a duplicate. If it is not in the segment cache, check the Summary Vector. If it is not in the Summary Vector, the segment is new. Write the new segment into the current container.

If it is in the Summary Vector, lookup the segment index for its container Id. If it is in the index, the incoming segment is a duplicate; insert the metadata section of the container into the segment cache. If the segment cache is full, remove the metadata section of the least recently used container first. If it is not in the segment index, the segment is new. Write the new segment into the current container.

LPC implements a segment cache to cache likely base segment descriptors for future duplicate segments. The segment cache maps a segment fingerprint to its corresponding container ID. Our main idea is to maintain the segment cache by groups of fingerprints. On a miss, LPC will fetch the entire metadata section in a container, insert all fingerprints in the metadata section into the cache, and remove all fingerprints of an old metadata section from the cache together. This method will preserve the locality of fingerprints of a container in the cache.

The operations for the segment cache are:

Init(): Initialize the segment cache.

Insert(container): Iterate through all segment descriptors in container metadata section, and insert each descriptor and container ID into the segment cache.

Remove(container): Iterate through all

segment descriptors in container metadata section, and remove each descriptor and container ID from the segment cache.

Lookup(fingerprint): Find the corresponding

container ID for the fingerprint specified.

Descriptors of all segments in a container are added or removed from the segment cache at once. Segment caching is typically triggered by a duplicate segment that misses in the segment cache, and requires a lookup in the segment index. As a side effect of finding the corresponding container ID in the segment index, we prefetch all segment descriptors in this container to the segment cache. We call this Locality Preserved Caching. The intuition is that base segments in this container are likely to be checked against for future duplicate segments, based on segment duplicate locality. Our results on real world data have validated this intuition overwhelmingly.

We have implemented the segment cache using a hash table. When the segment cache is full, containers that are ineffective in accelerating segment filtering are leading candidates for replacement from the segment cache. A reasonable cache replacement policy is Least-Recently-Used (LRU) on cached containers.

We aim to keep the segment index lookup to a minimum in segment filtering.

5

Experimental Results

How well does the deduplication storage system work with real world datasets?

How effective are the three techniques in terms of reducing disk I/O operations?

What throughput can a deduplication storage system using these techniques achieve?

We would like to answer the following questions:

For the first question, we will report our results with real world data from two customer data centers. For the next two questions, we conducted experiments with several internal datasets. Our experiments use a Data Domain DD580 deduplication storage system as an NFS v3 server [PJSS*94]. This deduplication system features two-socket duel-core CPU’s running at 3 Ghz, a total of 8 GB system memory, 2 gigabit NIC cards, and a 15-drive disk subsystem running software RAID6 with one spare drive. We use 1 and 4 backup client computers running NFS v3 client for sending data.

5.1 Results with Real World Data

The system described in this paper has been used at over 1,000 data centers. The following paragraphs report the deduplication results from two data centers, generated from the auto-support mechanism of the system.

276

FAST ’08: 6th USENIX Conference on File and Storage TechnologiesUSENIX Association

FAST有关论文。。

Figure 4: Logical/Physical Capacities at Data Center AFigure4:Logical/PhysicalCapacitiesatDataCenterA

Figure 5: CompressionFigure5:CompressionRRatios at Data Center AatiosatDataCenterA

MinMin

DailyglobalDaily global compressioncompressionDailylocalDaily local compressioncompression

10.0510.051.581.58

MaxMax74.3174.311.971.97

AverageAverage40.6340.631.781.78

Standard Standard

deviationdeviation13.7313.730.090.09

onnewsegments),cumulativeglobalcompressionratioon new segments), cumulative global compression ratio

(thecumulativeratioofdatareductionduetoduplicate(the cumulative ratio of data reduction due to duplicate segmentelimination),andcumulativetotalcompressionsegment elimination), and cumulative total compression ratio(thecumulativeratioofdatareductionduetoratio (the cumulative ratio of data reduction due to duplicatesegmenteliminationandZiv-LLempelstyleduplicate segment elimination and Ziv-Lempel style compressiononnewsegments)pression on new segments) over time.

sttAttheendof31s day, cumulative global compression day,cumulativeglobalcompressionAt the end of 31ratioreaches22.53to1,andcumulativetotalratio reaches 22.53 to 1, and cumulative total pression ratio reaches 38.54 to 1.

Table1:STable 1:Statistics on Daily GlobaltatisticsonDailyGlobalaand Daily Local ndDailyLocal

Compression Ratios at Data Center ACompressionRatiosatDataCenterA

DatacenterAbacksupstructureddatabasedataovertheData center A backs up structured database data over the courseof31daysduringtheinitialdeploymentofacourse of 31 days during the initial deployment of a deduplicationsystem.Thebackuppolicyistododailydeduplication system. The backup policy is to do daily fullbackups,whereeachfullbackupproducesover600full backups, where each full backup produces over 600 GBatsteadystate.Therearetwoexceptions:GB at steady state. There are two exceptions:

h

Duringtheinitialseedingphase(until6tth day in this dayinthisDuring the initial seeding phase (until 6example),differentdataordifferenttypesofdataareexample), different data or different types of data are rolledintothebackupset,asbackupadministratorsrolled into the backup set, as backup administrators figureouthowtheywanttousethededuplicationfigure out how they want to use the deduplication system.Alowrateofduplicatesegmentsystem. A low rate of duplicate segment identificationandeliminationistypicallyassociatedidentification and elimination is typically associated withtheseedingphase.with the seeding phase.

hTherearecertaindays(18tth day in this example) dayinthisexample)There are certain days (18whennobackupisgenerated.when no backup is generated.

ThedailyglobalcompressionratioschangequiteabitThe daily global compression ratios change quite a bit

overtime,whereasthedailylocalcompressionratiosareover time, whereas the daily local compression ratios are quitestable.Table1summarizestheminimum,quite stable. Table 1 summarizes the minimum, maximum,average,andstandarddeviationofbothdailymaximum, average, and standard deviation of both daily globalanddailylocalcompressionratios,excludingglobal and daily local compression ratios, excluding

h

seeding(thefirst6)daysandnobackup(18tth)day. day.seeding (the first 6) days and no backup (18

Data center B backs up a mixture of structured database DatacenterBbacksupamixtureofstructureddatabaseand unstructured file system data over the course of 48 andunstructuredfilesystemdataoverthecourseof48days during the initial deployment of a deduplication daysduringtheinitialdeploymentofadeduplicationsystem using both full and incremental backups. Similar systemusingbothfullandincrementalbackups.Similar

h

to that in data center A, seeding lasts until the 6tothatindatacenterA,seedinglastsuntilthe6tth day, day,

thh

andthereareafewdayswithoutbackups(8,12-1412-14tth,and there are a few days without backups (8thth

35 days). Outside these days, the maximum daily days).Outsidethesedays,themaximumdaily35

logicalbackupsizeisabout2.1TB,andthesmallestsizelogical backup size is about 2.1 TB, and the smallest size isabout50GB.is about 50 GB.

Figure6showsthelogicalcapacityandthephysicalFigure 6 shows the logical capacity and the physical capacityofthesystemovertimeatdatacenterB.capacity of the system over time at data center B.

hAt the end of 48Attheendof48tth day, the logical capacity reaches about day,thelogicalcapacityreachesabout41.4TB,andthecorrespondingphysicalcapacityis41.4 TB, and the corresponding physical capacity is about 3.0 TB. The total compression ratio is 13.71 to 1. about3.0TB.Thetotalcompressionratiois13.71to1.

Figure 4 shows the logical capacity (the amount of data Figure4showsthelogicalcapacity(theamountofdata

fromuserorbackupapplicationperspective)andthefrom user or backup application perspective) and the physicalcapacity(theamountofdatastoredindiskphysical capacity (the amount of data stored in disk media)ofthesystemovertimeatdatacenterA.media) of the system over time at data center A.

sttAttheendof31s day, the data center has backed up day,thedatacenterhasbackedupAt the end of 31

about16.9TB,andthecorrespondingphysicalcapacityabout 16.9 TB, and the corresponding physical capacity is less than 440 GB, reaching a total compression ratio of islessthan440GB,reachingatotalcompressionratioof38.54to1.38.54 to 1.

Figure 5 shows daily global compression ratio (the daily Figure5showsdailyglobalcompressionratio(thedailyrate of data reduction due to duplicate segment rateofdatareductionduetoduplicatesegmentelimination),dailylocalcompressionratio(thedailyrateelimination), daily local compression ratio (the daily rate ofdatareductionduetoZiv-Lempelstylecompressionof data reduction due to Ziv-Lempel style compression

Figure 7 shows daily global compression ratio, daily Figure7showsdailyglobalcompressionratio,daily

local compression ratio, cumulative global compression localcompressionratio,cumulativeglobalcompressionratio, and cumulative total compression ratio over time. ratio,andcumulativetotalcompressionratioovertime.

hAt the end of 48Attheendof48tth day, cumulative global compression day,cumulativeglobalcompressionreaches6.85,whilecumulativetotalcompressionreachesreaches 6.85, while cumulative total compression reaches 13.71.13.71.

USENIXAssociationFAST’08:6thUSENIXConferenceonFileandStorageTechnologies

277

FAST有关论文。。

Figure 6: Logical/Physical Capacities at Data Center B.Figure6:Logical/PhysicalCapacitiesatDataCenterB.

Figure 7: Compression Ratios at Data Center B.Figure7:CompressionRatiosatDataCenterB.

MinMin

DailyglobalDaily global compressioncompressionDaily local Dailylocalcompressioncompression

5.095.091.401.40

MaxMax45.1645.164.134.13

AverageAverage13.9213.922.332.33

StandardStandard

deviationdeviation9.089.080.570.57

effective.Independentofseeding,Ziv-LLempelstyleeffective. ndependent of seeding, Ziv-Lempel style

compression is relatively stable, giving a reduction of compressionisrelativelystable,givingareductionofabout2overtime.Therealworldobservationsontheabout 2 over time. The real world observations on the applicabilityofduplicatesegmenteliminationduringapplicability of duplicate segment elimination during seedingandafterseedingareparticularlyrelevantinseeding and after seeding are particularly relevant in evaluatingourtechniquestoreducediskaccessesbelow.evaluating our techniques to reduce disk accesses below. ISavingswithSummaryVectorand5.25.2 I/O Savings with Summary Vector and I/O

Locality Preserved Caching LocalaityPreservedCachingTodeterminetheeffectivenessoftheSummaryVectorTo determine the effectiveness of the Summary Vector andLocalityPreservedCaching,weexaminethesavingsand Locality Preserved Caching, we examine the savings fordiskreadstofindduplicatesegmentsusingafor disk reads to find duplicate segments using a SummaryVectorandLocalityPreservedCaching.Summary Vector and Locality Preserved Caching. Weusetwointernaldatasetsforourexperiment.OneisaWe use two internal datasets for our experiment. One is a dailyfullbackupofacompany-wideExchangedaily full backup of a company-wide Exchange informationstoreovera135-dayperiod.Theotheristheinformation store over a 135-day period. The other is the weeklyfullanddailyincrementalbackupofanweekly full and daily incremental backup of an Engineeringdepartmentovera100-dayperiod.Table3Engineering department over a 100-day period. Table 3 summarizeskeyattributesofthesetwodatasets.summarizes key attributes of these two datasets.

TheseinternaldatasetsaregeneratedfromproductionThese internal datasets are generated from production usage(albeitinternal).Wealsoobservethatvarioususage (albeit internal). We also observe that various compressionratiosproducedbytheinternaldatasetsarecompression ratios produced by the internal datasets are relativelysimilartothoseofrealworldexamplesrelatively similar to those of real world examples examinedinsection5.1.Webelievetheseinternalexamined in section 5.1. We believe these internal datasetsarereasonableproxiesofrealworlddatasets are reasonable proxies of real world deployments.deployments.

EachofthebackupdatasetsissenttothededuplicatingEach of the backup datasets is sent to the deduplicating storagesystemwithasinglebackupstream.Withrespectstorage system with a single backup stream. With respect tothededuplicationstoragesystem,wemeasuretheto the deduplication storage system, we measure the numberofdiskreadsforsegmentindexlookupsandnumber of disk reads for segment index lookups and localityprefetchesneededtofindduplicatesduringwritelocality prefetches needed to find duplicates during write forfourcases:for four cases:

(1) withneitherSummaryVectornorLocality(1)with neither Summary Vector nor Locality

PreservedCaching;Preserved Caching; (2) withSummaryVectoronly;(2)with Summary Vector only;

(3) withLocalityPreservedCachingonly;and(3)with Locality Preserved Caching only; and

Table2:STable 2:Statistics on Daily GlobaltatisticsonDailyGlobalaand Daily Local ndDailyLocal

Compression Ratios at Data Center BCompressionRatiosatDataCenterB

Logicalcapacity(TB)Logical capacity (TB)

Physicalcapacityafterdeduplicatingsegments(TB)(TB)

Global compressionGlobalcompression

Physicalcapacityafterlocal compression (TB)localcompression(TB)Local compression LocalcompressionTotal compressionTotalcompression

Exchange ExchangeEEngineering ngineering

datadatadatadata2.762.762.542.540.490.495.695.690.220.222.172.1712.3612.36

0.500.505.045.040.2610.2611.931.93

9.759.75

Table3:CTable 3:Capacities and Compression RatiosapacitiesandCompressionRatiosoon n

Exchange and Engineering DatasetsExchangeandEngineeringDatasets

Table2summarizestheminimum,maximum,average,Table 2 summarizes the minimum, maximum, average, and standard deviation of both daily global and daily andstandarddeviationofbothdailyglobalanddailylocal compression ratios, excluding seeding and days localcompressionratios,excludingseedinganddayswithoutbackup.without backup.

The two sets of results show that the deduplication Thetwosetsofresultsshowthatthededuplicationstorage system works well with the real world datasets. storagesystemworkswellwiththerealworlddatasets.Asexpected,bothcumulativeglobalandcumulativeAs expected, both cumulative global and cumulative totalcompressionratiosincreaseasthesystemholdstotal compression ratios increase as the system holds morebackupdata.more backup data.

Duringseeding,duplicatesegmenteliminationtendstoDuring seeding, duplicate segment elimination tends to beineffective,becausemostsegmentsarenew.Afterbe ineffective, because most segments are new. After seeding,despitethelargevariationintheactualnumber,seeding, despite the large variation in the actual number, duplicatesegmenteliminationbecomesextremelyduplicate segment elimination becomes extremely

278

FAST’08:6thUSENIXConferenceonFileandStorageTechnologiesUSENIXAssociation

FAST有关论文。。

(4) with both Summary Vector and Locality

Preserved Caching. The results are shown in Table 4.

Clearly, the Summary Vector and Locality Preserved Caching combined have roduced an astounding reduction in disk reads. Summary Vector alone reduces about 16.5% and 18.6% of the index lookup disk I/Os for exchange and engineering data resectively. The Locality Preserved Caching alone reduces about 82.4% and 81% of the index lookup disk I/Os for exchange and engineering data respectively. Together they are able to reduce the index lookup disk I/Os by 98.94% and 99.6% respectively.

In general, the Summary Vector is very effective for new data, and Locality Preserved Caching is highly effective for little or moderately changed data. For backup data, the first full backup (seeding equivalent) does not have as many duplicate data segments as subsequent full backups. As a result, the Summary Vector is effective to avoid disk I/Os for the index lookups during the first full backup, whereas Locality Preserved Caching is highly beneficial for subsequent full backups. This result also suggests that these two datasets exhibit good duplicate locality.

versions where each successive version (“generation”) is asomewhat modified copy of the preceding generation in the series. The generation-to-generation modifications include: data reordering, deletion of existing data, and addition of new data. Single-client backup over time is simulated when synthetic data generations from a backup stream are written to the deduplication storage system in generation order, where significant amounts of data are unchanged day-to-day or week-to-week, but where small changes continually accumulate. Multi-client backup over time is simulated when synthetic data generations from multiple streams are written to the deduplication system in parallel, each stream in the generation order. There are two main advantages of using the synthetic dataset. The first is that various compression ratios can be built into the synthetic model, and usages

papproximating various real world deployments can be

tested easily in house. p

The second is that one can use relatively inexpensive client computers to generate an arbitrarily large amount of synthetic data in memory without disk I/Os and write in one stream to the deduplication system at more than 100 MB/s. Multiple cheap client computers can combine in multile streams to saturate the intake of the dedulication system in a switched network environment. We find it both much more costly and technically challenging using traditional backu software, high-end client computers attached to primary storage arrays as backup clients, and high–end servers as media/backup servers to accomplish the same feat. In our experiments, we choose an average generation (daily equivalent) global compression ratio of 30, and an average generation (daily equivalent) local compression ratio of 2 to 1 for each backu stream. These compression numbers seem possible given the real world examples in section 5.1. We measure throughput for one

5.3 Throughput

To determine the throughput of the deduplication storage

system, we used a synthetic dataset driven by client

computers. The synthetic dataset was developed to

pmodel backup data from multiple backup cycles from

multiple backup streams, where each backup stream can be generated on the same or a different client computer. The dataset is made up of synthetic data generated on the fly from one or more backup streams. Each backup stream is made up of an ordered series of synthetic data

Exchange data#disk I/Os

no Summary Vectorand

no Locality Preserved Caching Summary Vector only

Locality Preserved Caching onlySummary Vectorand

Locality Preserved Caching

328,613,503274,364,78857,725,8443,477,129

%of total100.00%83.49%17.57%1.06%

Engineering data#disk I/Os318,236,712259,135,17160,358,8751,257,316

%of total100.00%81.43%18.97%0.40%

Table 4:Index and locality reads. This table shows the number disk reads to perform index lookups or fetchesfrom the container metadata for the four combinations: with and without the Summary Vector and with and without Locality Preserved Caching.Without either the Summary Vector or Locality Preserved Caching,there is an index read for every segment. The Summary Vector avoids these reads for most new segments. Locality Preserved Cachingavoids index lookups for duplicate segments at the cost an extra read to fetchagroupof segment fingerprints from the container metadata for every cache miss for which the segment is foundin the index.

USENIXAssociationFAST’08:6thUSENIXConferenceonFileandStorageTechnologies

279

FAST有关论文。。

I

Figure8:WrFigure 8:Write Throughput of Single Backup ClientiteThroughputofSing

leBackupClientaand nd4BBackup Clients.ackupClients.Figure9:RFigure 9:Read Throughput of Single Backup ClienteadThroughputofSingleBackupClientand and

4BBackup ClientsackupClients

backupstreamusingoneclientcomputerand4backupbackup stream using one client computer and 4 backup

streams using two client computers for write and read for str

eamsusingtwoclientcompute

rsfor

writeandreadfo

r10 generations of the backup datasets. The results are 10generationsofthebackupdatasets.TheresultsareshowninFigures8and9.shown in Figures 8 and 9.

The deduplication system delivers high write throughput Thededuplicationsystemdelivershighwritethroughputresults for both cases. In the single stream case, the

resultsforbothcases.nthesinglestreamcase,thesystemachieveswritethroughputof110MB/secforsystem achieves write throughput of 110 MB/sec for generation0andover113MB/secforgeneration1generation 0 and over 113 MB/sec for generation 1

through9.nthe4streamcase,thesystemachievesthrough 9. In the 4 stream case, the system achieves

writethroughputof139MB/secforgeneration0andawrite throughput of 139 MB/sec for generation 0 and a sustained217MB/secforgeneration1through9.sustained 217 MB/sec for generation 1 through 9.

Writethroughputforgeneration0islowerbecauseallWrite throughput for generation 0 is lower because all segmentsarenewandrequireZiv-Lempelstylesegments are new and require Ziv-Lempel style

pression by the CPU of the deduplication system. ThesystemdelivershighreadthroughputresultsfortheThe system delivers high read throughput results for the

singlestreamcase.Throughoutallgenerations,thesingle stream case. Throughout all generations, the systemachievesover100MB/secreadthroughput.system achieves over 100 MB/sec read throughput. Forthe4streamcase,thereadthroughputis211MB/secFor the 4 stream case, the read throughput is 211 MB/sec forgeneration0,192MB/secforgeneration1,165for generation 0, 192 MB/sec for generation 1, 165 MB/secforgeneration2,andstayataround140MB/secMB/sec for generation 2, and stay at around 140 MB/sec forfuturegenerations.Themainreasonforthedecreasefor future generations. The main reason for the decrease ofreadthroughputinthelatergenerationsisthatfutureof read throughput in the later generations is that future generationshavemoreduplicatedatasegmentsthanthegenerations have more duplicate data segments than the firstfew.However,thereadthroughputstaysataboutfirst few. However, the read throughput stays at about 140MB/secforlatergenerationsbecauseofStream-140 MB/sec for later generations because of Stream-nformed Segment Layout and Locality Preserved nformedSegmentLayoutandLocalityPreservedCaching.Caching.

NotethatwritethroughputhashistoricallybeenvaluedNote that write throughput has historically been valued morethanreadthroughputforthebackupusecasesincemore than read throughput for the backup use case since backuphastocompletewithinaspecifiedbackupbackup has to complete within a specified backup windowtimeperiodanditismuchmorefrequenteventwindow time period and it is much more frequent event thanrestore.Readthroughputisstillveryimportant,than restore. Read throughput is still very important, especiallyinthecaseofwholesystemrestores.especially in the case of whole system restores.

5.45.4 Discussion Discussion

ThetechniquespresentedinthispaperaregeneralThe techniques presented in this paper are general

methodstoimprovethroughputperformanceofmethods to improve throughput performance of deduplicationstoragesystems.Althoughoursystemdeduplication storage systems. Although our system dividesadatastreamintocontent-basedsegments,thesedivides a data stream into content-based segments, these methodscanalsoapplytosystemusingfixedalignedmethods can also apply to system using fixed aligned

segmentssuchasVenti.segments such as Venti.

Asasidenote,wehavecomparedthecompressionratiosAs a side note, we have compared the compression ratios ofasystemsegmentingdatastreamsbycontents(aboutof a system segmenting data streams by contents (about 8Kbytesonaverage)withanothersystemusingfixed8Kbytes on average) with another system using fixed aligned8Kbytessegmentsontheengineeringandaligned 8Kbytes segments on the engineering and exchangebackupdatasets.Wefoundthatthefixedexchange backup datasets. We found that the fixed alignmentapproachgetsbasicallynoglobalcompressionalignment approach gets basically no global compression (globalcompression:1.01)fortheengineeringdata,(global compression: 1.01) for the engineering data, whereasthesystemwithcontent-basedsegmentationwhereas the system with content-based segmentation getsalotofglobalcompression(6.39:1).Themaingets a lot of global compression (6.39:1). The main reasonofthedifferenceisthatthebackupsoftwarereason of the difference is that the backup software createsthebackupdatasetwithoutrealigningdataatfilecreates the backup dataset without realigning data at file boundaries.Fortheexchangebackupdatasetwheretheboundaries. For the exchange backup dataset where the backupsoftwarealignsdataatindividualmailboxes,thebackup software aligns data at individual mailboxes, the globalcompressiondifferenceisless(6.61:1vs.global compression difference is less (6.61:1 vs. 10.28:1),thoughthereisasignificantgap.10.28:1), though there is a significant gap.

FragmentationwillbecomemoresevereforlongtermFragmentation will become more severe for long term retention,andcanreducetheeffectivenessofLocalityretention, and can reduce the effectiveness of Locality PreservedCaching.WehaveinvestigatedmechanismstoPreserved Caching. We have investigated mechanisms to reducefragmentationandsustainhighwriteandreadreduce fragmentation and sustain high write and read throughput.But,thesemechanismsarebeyondthescopethroughput. But, these mechanisms are beyond the scope ofthispaper.of this paper.

6Related Work RelatedWork

MuchworkondeduplicationfocusedonbasicmethodsMuch work on deduplication focused on basic methods andcompressionratios,notonhighthroughput.and compression ratios, not on high throughput.

Earlydeduplicationstoragesystemsusefile-levelEarly deduplication storage systems use file-level hashingtodetectduplicatefilesandreclaimtheirstoragehashing to detect duplicate files and reclaim their storage space[ABCC*02,TKSK*03,KDLT04].Sincesuchspace [ABCC*02, TKSK*03, KDLT04]. Since such

280

FAST’08:6thUSENIXConferenceonFileandStorageTechnologiesUSENIXAssociation

FAST有关论文。。

systems also use file hashes to address files. Some call such systems content addressed storage or CAS. Since their deduplication is at file level, such systems can achieve only limited global compression.

Venti removes duplicate fixed-size data blocks by comparing their secure hashes [QD02]. It uses a large on-disk index with a straightforward index cache to lookup fingerprints. Since fingerprints have no locality, their index cache is not effective. When using 8 disks to lookup fingerprints in parallel, its throughput is still limited to less than 7 MB/sec. Venti used a container abstraction to layout data on disks, but was stream agnostic, and did not apply Stream-Informed Segment Layout.

To tolerate shifted contents, modern deduplication systems remove redundancies at variable-size data blocks divided based on their contents. Manber described amethod to determine anchor points of a large file when certain bits of rolling fingerprints are zeros [Man93] and showed that Rabin fingerprints [Rab81, Bro93] can be computed efficiently. Brin et al. BDH94] described several ways to divide a file into content-based data segments and use such segments to detect duplicates in digital documents. Removing duplications at content-based data segment level has been applied to network protocols and applications [SW00, SCPC*02, RLB03, MCK04] and has reduced network traffic for distributed file systems [MCM01, JDT05]. Kulkarni et al. evaluated the compression efficiency between an identity-based (fingerprint comparison of variable-length segments) approach and a delta-compression approach [KDLT04]. These studies have not addressed deduplication throughput issues.

The idea of using Bloom filter [Blo70] to implement the Summary Vector is inspired by the summary data structure for the proxy cache in [FCAB98]. Their work also provided analysis for false positive rate. In addition, Broder and Mitzenmacher wrote an excellent survey on network applications of Bloom filters [AM02]. TAPER system used a Bloom filter to detect duplicates instead of detecting if a segment is new [JDT05]. It did not investigate throughput issues.

We have shown that Summary Vector can reduce disk index lookups by about 17% and Locality Preserved Caching can reduce disk index lookups by over 80%, but the combined caching techniques can reduce disk index lookups by about 99%.

Stream-Informed Segment Layout is an effective abstraction to preserve spatial locality and enable Locality Preserved Caching.

These techniques are general methods to improve throughput performance of deduplication storage systems. Our techniques for minimizing disk I/Os to achieve good deduplication performance match well against the industry trend of building many-core processors. With quad-core CPU’s already available, and eight-core CPU’s just around the corner, it will be a relatively short time before a large-scale deduplication storage system shows up with 400 ~ 800 MB/sec throughput with a modest amount of physical memory.

8References

7Conclusions

This paper presents a set of techniques to substantially reduce disk I/Os in high-throughput deduplication storage systems.

Our experiments show that the combination of these techniques can achieve over 210 MB/sec for 4 multiple write data streams and over 140 MB/sec for 4 read data streams on storage server with two dual-core processors and one shelf of 15 drives.

ABCC*02] A. Adya, W. J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, and R. P. Wattenhofer. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In Proceedings of USENIX Operating Systems Design and Implementation (OSDI),December 2002.

[BM05] Andrie Z. Broder and Michael Mitzenmacher. Network Applications of Bloom Filters: A Survey. Internet Mathematics,2005.

BDH94] S. Brin, J. Davis, H. Carcia-Molina. Copy Detection Mechanisms for Digital Documents (weblink). 1994, also lso in Proceedings of ACM SIGMOD,1995.

Blo70] Burton H. Bloom. Space/time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM,13 (7). 422-426.

[JDT05] N. Jain, M. Dahlin, and R. Tewari. TAPER: Tiered Approach for Eliminating Redundancy in Replica Synchronization. In Proceedings of USENIX File And Storage Systems (FAST),2005.

[Dat05] Data Domain, Data Domain Appliance Series: High-Speed Inline Deduplication Storage, 2005, [FCAB98] Li Fan, Pei Cao, Jussara Almeida, and Andrie Z. Broder. Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol. in Proceedings of ACM SIGCOMM'98,(Vancouver, Canada, 1998).

[KDLT04] P. Kulkarni, F. Douglis, J. D. LaVoie, J. M. Tracey: Redundancy Elimination Within Large Collections of Files. In Proceedings of USENIX Annual Technical Conference,pages 59-72, 2004.

USENIXAssociationFAST’08:6thUSENIXConferenceonFileandStorageTechnologies

281

FAST有关论文。。

[Man93] Udi Manber. Finding Similar Files in A Large File System. Technical Report TR 93-33, Department of Computer Science, University of Arizona, October 1993, also in Proceedings of the USENIX Winter 1994 Technical Conference, pages 17–21. 1994.

[MCK04] J. C. Mogul, Y.-M. Chan, and T. Kelly. Design, implementation, and evaluation of duplicate transfer detection in HTTP. In Procdings of Network Systems Design and Implementation, 2004. [MCM01] Athicha Muthitacharoen, Benjie Chen, and David Mazières. A Low-bandwidth Network File System In Proceedings of the ACM 18th Symposium on Operating Systems Principles. Banff, Canada. October, 2001.

[NIST95] National Institute of Standards and Technology, FIPS 180-1. Secure Hash Standard. US Department of Commerce, April 1995.

[PJSS*94] B. Pawlowski, C. Juszczak, P. Staubach, C. eeeeeSmith, D. Lebel, and D. Hitz, NFS Version 3 Design and Implementation, In Proceedings of the USENIX Summer 1994 Technical Conference. 1994.

[QD02] S. Quinlan and S. Dorward, Venti: A New Approach to Archival Storage. In Proceedings of the USENIX Confrnc on Fil And Storag Technologies (FAST), January 2002.

[RLB03] S. C. Rhea, K. Liang, and E. Brewer. Value-based web caching. In WWW, pages 619–628, 2003. [SCPC*02] C. P. Sapuntzakis, R. Chandra, B. Pfaff, J. Chow, M. S. Lam, and M. Rosenblum. Optimizing the migration of virtual computers. In Proceedings of USENIX Oprating Systms Dsign and Implementation, 2002.

[SW00] N. T. Spring and D. Wetherall. A protocol-independent technique for eliminating redundant network traffic. In Proceedings of ACM SIGCOMM, pages 87--95, Aug. 2000.

[TKSK*03] N. Tolia, M. Kozuch, M. Satyanarayanan, B. Karp, A. Perrig, and T. Bressoud. Opportunistic use of content addressable storage for distributed file systems. In Proceedings of the 2003 USENIX Annual Technical Conference, pages 127–140, San Antonio, TX, June 2003.

[YPL05] L. L. You, K. T. Pollack, and D. D. E. Long. Deep Store: An archival storage system architecture. In Proceedings of the IEEE International Conference on Data Engineering (ICDE ’05), April 2005.

[ZL77] J. Ziv and A. Lempel. A universal algorithm for sequential data compression, IEEE Trans. Inform. Theory, vol. IT-23, pp. 337-343, May 1977.

eee

282

FAST ’08: 6th USENIX Conference on File and Storage TechnologiesUSENIX Association

本文来源:https://www.bwwdw.com/article/gbej.html

Top