Probabilistic Approach to Avoid Uncorrectable Bit Errors in Storage Systems

Loading...
Thumbnail Image

Authors

Bhuiyan, Masudul Hasan Masud

Issue Date

2020

Type

Thesis

Language

Keywords

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

Silent data corruption in storage system poses a significant risk to the integrity ofdata. While error correction codes (ECC) can recover the majority of the errors, a non-negligible portion of them escape ECC, referred to as uncorrectable errors. As the scale of storage systems increases, the mean time between uncorrectable errors is reduced from months to hours, necessitating efficient ways to detect and handle them. In this thesis, we propose prediction models for uncorrectable errors by analyzing 150M daily SMART logs from 143K hard drives collected over the period of five years. The models achieve up-to 97% accuracy in uncorrectable bit error prediction while keeping false positive rates less than 3%. We further introduce two use cases to utilize highly accurate error prediction models to (i) mitigate the I/O overhead of file transfer integrity verification on file systems and to (ii) reduce the amount of I/O that is processed by disks with uncorrectable errors. Evaluation results show that running integrity verification only for disks with high error probability allows up to 97% decrease in I/O overhead of file transfers while avoiding more than 90% of uncorrectable errors. Moreover, diverting I/O operations from high-risk disks to low-risk disks can reduce the amount of data exposed to an uncorrectable error by 80% while keeping the overhead on low-risk disks less than 5%.

Description

Citation

Publisher

Journal

Volume

Issue

PubMed ID

DOI

ISSN

EISSN