« Twiddling my thumbs | Main | Anatomy of a preview »

January 13, 2005

Comments

blaze

Yeah, another problem was that people probably tried to relog a lot when it crashed out. This causes kind of a negative feedback loop that makes the problem worse.

The latest fix should solve that.

Was it the raid array rebuilding that killed the performance? How much load do you think you can handle with a safety margin?

My theory always was that if you were over 25% load then it was time to by more servers / processing power (in case of traffic spikes / raid array rebuilding / software bugs / etc).

It's this theory which makes 99.9% twice as expensive as 99% uptime.

Ian Linden

As soon as you blow a disk in a RAID-5, you're running in a degraded state - requests for blocks on the failed disk must be computed based on the parity information, which is alot slower than simply reading it off a single disk. In this case we didn't have a hot spare, and didn't want to risk changing out the drive, so the raid wasn't rebuilding.

The new machine has an N+2 parity scheme as well as hot spares - so the impact of a failure will be less, and the rebuild time short, but hard numbers WRT degraded performance are hard to come by. We'll be doing additional tests on the standby unit.

How much load we can ultimately handle on the new hardware depends entirely on whether you're talking about pre- or post-1.6. With at least half, and possibly quite a bit more, of our I/O load coming from that one query on login, this number may get very high indeed after we release 1.6.

Carnildo Greenacre

I've never done load testing in a high-load environment, but wouldn't it be possible to hook up the live server and a server being tested in parallel, so that live input is sent to both, while the output from the test server is discarded in such a way that the test server thinks it's live?

Ian Linden

That is a good way to do unit-testing on specific systems which you have multiple instances of. Doesn't really help for testing the whole grid though, and these bugs involved the interaction of many components. A test for this sort of thing *could* have been devised, but wasn't.

Fritz Rosencrans

I have been administering finance systems under Solatis and Sybase for 5 years now, and we had similar issues with performance and reliability with different RAID levels.

Experience showed us that RAID 5 is only useful for a query database, because writing is an order of magnitude slower than on RAID 0+1.

At the same time, running EITHER of these two RAID systems without at least one hospare per stripe was asking for trouble.

In RAID 5, ALL accesses, but also dramatically write accesses, are MUCH slower if a disk fails. The performance can easily become unusable. Better is two hot spares on RAID 5. This leaves one hot spare available during the period that the dud disk is replaced - a very critical time, as SLs experience has shown. And takes the heat off, if the replacement drive can only be installed after a longer period, like a week...

RAID 0+1 has the hole that if 1 disk fails, the mirrroring is gone, if there is no hot spare. And if only one hot spare exists, the situation remains critical over the disk swap period, as with RAID 5.

With the price of disks thorougly minimised, compared to even 3 years ago, the alternative to not enough hot spares has become ridiculously expensive, in down time and in the costs of emergency measures.

I could not recommend RAID 5 in a system that has a lot of updates, particularly if these updates (changing clothes, for example, or flying across sim boundaries) are exactly the points where the User is sensitive...

RAID 1+0, where the disks are first mirrored, and then striped, would be great, and more easily scalable, but it doesn't run on all hardware... And STILL is no substitue for hot spares...

Have fun with the admin! I love the job, but it can be a killer if the system is not optimized well.

Fritz Rosencrans

addendum, not QUITE on topic, but related to redundancy, stability and above all, getting performance where the load is and overcoming the cpu - sim tight coupling: a Virtualization layer called Virtual Iron VFe 1.0 is offering a way to address a cluster of boxes as several virtual machines. This type of virtualization COULD be a major plus in stability and performance, without an application rework...

Just a thought! But from experience I know how depressing it can be, having several isolated boxes overloaded, and a large number just standing around twiddling their thumbs. And this seems a way out here.

You guys in OPS are my family, almost. I wish you all the best, and fever with you.

.. so please excuse my getting big mouthed here :-) ... It is meant kindly.

The comments to this entry are closed.