![]() Moreover, EPAS makes a more preferable tradeoff between precision and recall than that of other similarity detection algorithms. Our experimental results demonstrate that the EPAS significantly outperforms the existing well known algorithms in terms of time overhead, CPU and memory occupation. Furthermore, this paper describes a query algorithm to reduce the time overhead of similarity detection. Meanwhile, an improved metric is proposed to measure the similarity between different files and make the possible detection probability close to the actual probability. EPAS concurrently samples data blocks from the head and the tail of the modulated file to avoid the position shift incurred by the modifications. This paper proposes an Enhanced Position-Aware Sampling algorithm (EPAS) to identify file similarity for the cloud by modulo file length. Therefore, a failure of similarity identification is inevitable due to the slight modifications. However, a slight modification of source files will trigger the bit positions of file content shifting. The overhead of TSA is fixed and negligible. Instead of reading entire file, TSA samples some data blocks to calculate the fingerprints as similarity characteristics value. In addition, the overhead increases with the growth of data set volume and results in a long delay. The Shingle, Simhash and Traits algorithms read entire source file to calculate the corresponding similarity characteristic value, thus requiring lots of CPU cycles and memory space and incurring tremendous disk accesses. Many typical algorithms such as Shingle, Simhash, Traits and Traditional Sampling Algorithm (TSA) are extensively used. ![]() Similarity detection plays a very important role in data management. Thus, any increased latency may cause a massive loss to the enterprises. These data usually have to be processed in a timely fashion in the cloud. The explosive growth of data brings new challenges to the data storage and management in cloud environment. Experimental results demonstrate that the proposed idea can effectively reduce the number of fingerprint accesses going to disk drives, decrease the query overhead of fingerprints, thus significantly alleviating the disk bottleneck of data deduplication. Furthermore, the fingerprints are arranged sequently in terms of the backup data stream to maintain the locality and promote the performance. This paper proposes to employ file similarity to enhance the fingerprint prefetching, thus improving the cache hit ratio and the performance of data deduplication. This results in very low cache hit ratio due to lacking temporal locality. Additionally, a single fingerprint may appear only once during a backup process. This generates random and small disk accesses, and results in significant performance degradation when the fingerprints are referred. Furthermore, the fingerprints belonging to the same file may be discretely stored on disk drives. This incurs frequent disk accesses to locate fingerprints and blocks the process of data deduplication. This is because the volume of fingerprints grows significantly with the increase of backup data, and a large portion of fingerprints have to be stored on disk drives. However, the performance of data deduplication gradually decreases with the growth of deduplicated data. ![]() Data deduplication has been widely used at data backup system due to the significantly reduced requirements of storage capacity and network bandwidth.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |