Probabilistic Approach to Avoid Uncorrectable Bit Errors in Storage Systems
Loading...
Authors
Bhuiyan, Masudul Hasan Masud
Issue Date
2020
Type
Thesis
Language
Keywords
Alternative Title
Abstract
Silent data corruption in storage system poses a significant risk to the integrity ofdata. While error correction codes (ECC) can recover the majority of the errors, a
non-negligible portion of them escape ECC, referred to as uncorrectable errors. As
the scale of storage systems increases, the mean time between uncorrectable errors is
reduced from months to hours, necessitating efficient ways to detect and handle them.
In this thesis, we propose prediction models for uncorrectable errors by analyzing
150M daily SMART logs from 143K hard drives collected over the period of five
years. The models achieve up-to 97% accuracy in uncorrectable bit error prediction
while keeping false positive rates less than 3%. We further introduce two use cases
to utilize highly accurate error prediction models to (i) mitigate the I/O overhead
of file transfer integrity verification on file systems and to (ii) reduce the amount of
I/O that is processed by disks with uncorrectable errors. Evaluation results show
that running integrity verification only for disks with high error probability allows
up to 97% decrease in I/O overhead of file transfers while avoiding more than 90%
of uncorrectable errors. Moreover, diverting I/O operations from high-risk disks to
low-risk disks can reduce the amount of data exposed to an uncorrectable error by
80% while keeping the overhead on low-risk disks less than 5%.