March 29, 2006


eggy lippmann

OK, sorry if I'm pointing out the obvious here...
Are all the nodes in your asset cluster perfectly homogeneous? As in, same hardware, same software, and were they all bought within a narrow timeframe?
This would explain why the whole cluster crashed simultaneously :)
In my experience hardware has a rather narrow MTBF window. If you have two computers with similar usage patterns (proper load-balancing ensures that you do) they will often die around the same time.
Hardware / Software monocultures additionally introduce single points of failure. If something tickles a bug at one node then it's likely to occur in another.

