Cloud stress or why computing clouds are not for everyone
Yesterday Slashdot featured a story about a study conducted to evaluate response time of the major cloud infrastructure providers
http://tech.slashdot.org/story/09/08/20/0327205/Amazon-MS-Google-Clouds-Flop-In-Stress-Tests
One of the main findings was that "Response times on the service also varied by a factor of twenty depending on the time of day the services were accessed".
Unfortunately this is not a surprise to me. One of the main issues with shared infrastructure is well ... sharing. There will always be a user or couple users that will for one reason or another use infrastructure inefficiently and this will end up degrading everyone's performance. For example you may have a shared database machine and a user who decides to do full backups daily. Guess what while those backups are running your other users will be severly impacted.
Things are even more complicated in the cloud since you are usually running a virtualized instance which is sharing a piece of physical hardware with other virtualized instances. As such you have very little insight into what other instances are doing and they may be doing a lot to degrade your performance. Even though most of the virtualization technologies promise isolation ie. controlling how much I/O or CPU particular instance gets practice is different. For instance I run a number of Xen hosts/guest and I can see if a particular Xen guest goes crazy ie. starts thrashing the disk all the other Xen guests will start "seeing" higher CPU wait I/Os. This leads me to a story of sorts some time ago I signed up for service from an inexpensive VM vendor (we're talking $10-20/month cheap) so I can run my own web server and mail. Machine was excruciatingly slow most of the time, so slow that typing commands on the prompt took couple seconds yet I wasn't running anything on it. After I installed Ganglia I noticed that CPU WAIT I/O was about 10% most of the time and Load one was average of 4. Remember I haven't even installed anything on this machine. They moved me to a different machine but the same thing happened so I cancelled the service. Company was obviously over subscribing their machines or they had lots and lots of "abusers".
I am not trying to say that clouds are useless but for years (since EC2 was in beta) I heard a lot of preposterous claims about clouds. Even to the point where it was suggested we should run a back up data center on EC2 since it was "cheaper". Even if we could get away from the security concerns ie. not being able to run VLANs, having your traffic cross a shared bridged network interface etc. I just don't see if you needed any type of reliable performance you can rely on clouds to deliver. Sure you could try to get clever with the load balancer but in any case there is always a potential that a set of your visitors will end up on a web server that is affected by someone else's process or worse that all of the sudden your site is terribly slow and there is literally no explanation for it. Try to explain that to your boss
.
That said there are obviously cases where clouds could be great ie. when you need to scale quickly from let's say handful machines to dozens of machines then dispose of them when you are done etc. There are likely other scenarios but you really have to evaluate it application by application.