Amazon S3 had a major availability incident this Sunday and posted today a very transparent update on their blog about the causes of the problem and actions they are taking to prevent it from happening again.

From their report, it seems like a bit corruption in a control message (which ought to happen in a system of such large scale) combined with a gossiping protocol which spread (apparently) too much information across the system caused a mayhem in the server communication. When the engineers understood what was going on, they realized that the way to bring the system back to normal operation was to stop it and clear its state, what is popularly known as restarting it.

Lessons learned? Mainly, (1) if the scale is large enough, all kinds of bizarre behaviors will eventually show up and (2) having an efficient red button to bring the system to a clean state is very useful if you are running a long-lived system. (I'd add that spreading too much the state of the system is a trade-off between global knowledge and robustness, but this would lead to a lengthy discussion).

Interestingly, these are two lessons previously discussed by the operators of PlanetLab in a Usenix paper a while ago. PlanetLab has also experienced corrupted control messages (for which we typically do not do checksums) and implemented a red button which has already been used in at least one occasion, in December 2003.

3 comments:

Lauro said...

Let's shake your blog.
"trade-off between global knowledge and robustness". Don't you want to begin the discussion here? =)

Anonymous said...

Very interesting post Nazareno! It's good to know real cases like these that you showed up.

This discussion makes me remember something that used to be hot topic few time ago: The software Rejuvenation.

Maybe the red button could be a coordinated sequence of steps, and even better, it could be pushed automatically.

control valves said...

I believe construction of such projects requires knowledge of engineering and management principles and business procedures, economics, and human behavior.