Thursday, October 11, 2012

Availability in Globally Distributed Storage Systems



This paper is a study of cloud storage systems at Google for a period of one year.  They have developed two different models and compared their predictions from the models with the statistics they have observed. With large number of production cells they have gathered enough statistics for various encoding schemes and other system parameters.

Availability can be broadly classified into:
Component availability - mean time to Failure (MTTF) of any system components which include machines, disks, racks.
Data availability-depends on node availability, encoding scheme used. They modeled correlated failures to understand the availability of data, single cell vs. multi-cell replication schemes.

Some terminology before going ahead:

Cell: It is a large collection of nodes coordinated with a high level program.
Disk Scrubbing: Disk scrubbing is an operation that reads all of the user data and check data blocks in a RAID array, and then relocates them if defects are found in the media.
Erasure coding: An erasure encoding transforms stripe of k units to a stripe of n units, so that it can be reconstructed from a minimum subset of units within this new stripe. Eg: Reed-Solomon encoding
 Mean Time to Failure: MTTF = uptime / number of failures
Failure burst: Sequence of node failures within a time window of w secs(w set to 120).
Failure domain:  It means all those that can potentially suffer from a common failure source and their  simultaneous failures are said to be correlated.

Problem – Failures:

They classified node failures into three categories like node restarts(software initiated restart), planned reboots(kernel upgrade), unplanned reboots(crashes).  And the important observations are that majority of unavailability caused due to node failures is because of the “planned reboots”.  But “unplanned reboots”  have the longest average duration because of recovery to be performed on unexpected system crash.
They claim that there is not much related work on discovering patterns in “Correlated Failures”.  In their study they found that only 3% of failure bursts of size greater than 10 had all the nodes in unique racks, which shows rack is a potential failure domain. Similarly network switches, and power domains are  other potential failure domains.  They have defined a score ‘rack affinity’ to identify if a failure burst is rack correlated , uncorrelated or anti-correlated. They observed that larger failure bursts have higher rack-affinity.

Solution- Replication and Recovery

Chunks can go unavailable due to many reasons and using the Erasure encoding schemes they can be recovered if there are minimum number of chunks available within that stripe based on the encoding used. But this rate is highly affected by the limited bandwidth of disks, nodes and racks. Recovery of these chunks would in turn make other chunks on this node unavailable till the recovery is complete. Rack-aware policy,  avoids placing chunks of a stripe on nodes on a single rack. It spans multiple racks so as to avoid correlated failures within the rack leading to complete unavailability of the chunk.  They have a observed a gain of 3 times in stripe MTTF using this policy.

Models and Simulations:

One, Cell-simulation, they developed a trace-based simulation method to analyze hypothetical scenarios, to understand the effect of the encoding choice and chunk recovery rate. They have predicted Unavailability over time for a given cell with large failure bursts and the results show they were close to what is actually measured for that cell.
Two, they formulated a Markov model to observe interaction between failures in a better way than the Cell simulation model . This model assumes that events occur independently but with a constant rate over time. They build a Markov chain for chunk recovery by giving priority to each state, which represent the chunk availability condition. They extend this model to accommodate multi-cell replication.

Using this model :

  1. They could successfully capture the effect of failure bursts,
  2. they could distinguish between bursts which span racks and which do not,
  3. as they enough production cells they have also validated this model for various encodings and operating conditions,
  4. they found reducing recovery time is effective only when there are few correlated failures,
  5. they also observed that gain achieved by increased replication slows down with correlated failures,
  6. multi-cell replication is the best way to avoid unavailability due to correlated failures, but with a tradeoff on network bandwidth consumption.


Conclusion:

In this paper they have focused more on the findings of the Markov model. They have studied the effects of different parameters like replication, data placement and other system parameters From the results it shows that most of the data unavailability is attributed to node failures than actual disk errors which lead to permanent data loss.  As these are findings from the Google clusters which in itself is very huge in magnitude, I believe most of the observations and findings can be directly used to implement policies in future systems.

2 comments:

jon_weissman said...

Good analysis. More critical comments?

sasank said...

1) They have done a lot of related study on data recovery, data center management, failures in disk drives, probability and statistics and have referenced 35 papers for their study.

2) They cannot study network failures out of the Cell which are also responsible for unavailability when the only available chunk is on a geographically separated Cell. They can only be cautious in multi-cell replication.

3) Disk errors, these too cannot be avoided, lot of study is going on this field, with faster devices like SSDs becoming prevalent.

4) May be this study can help solve kernel bug crashes by analyzing the scenarios that led to it and they will evolve.

5)Only way out is using efficient erasure coding schemes and replication parameters based on the type of application requirements, Optimized planned reboots as they are responsible for most part of the "Unavailability",