Quantcast

Archive for December, 2009

Cloud infrastructure performance

Monday, December 14th, 2009

I read with interest a post about measuring disk I/O performance on EC2.

http://stu.mp/2009/12/disk-io-and-throughput-benchmarks-on-amazons-ec2.html

It is a good test however results were not unexpected. The problem with shared infrastructure is not that it provides subpar performance but the fact that in any infrastructure which can be "modified" by a customer you will run into "abuse" where one or couple customers will use infrastructure unevenly and will affect other customers. I have blogged about virtualization stress points before ie.

http://vuksan.com/blog/2009/12/04/cloud-cartography-load-co-residence-detection/

http://vuksan.com/blog/2009/09/01/cloud_computings_achilles_heel/

I have also in the past been in charge of a operations for an e-commerce SaaS startup and we would see this issue quite often. For instance we had two customers that did about the same amount of yearly sales yet one of the customers' infrastructure utilization (number of disk ops, DB bandwidth etc.) was 3-4 times higher than the other customer. At times they would "abuse" a shared database so much that it affected everyone else. We resolved it when a collegue figured out that we could QoS traffic to the database. That way only the abusing customer would be affected if they did anything crazy. It also helped that we ran the infrastructure and the application so we could quickly determine what is normal and what is not.  I suspect this problem becomes much trickier in clouds since you have very little idea what applications are running and what is normal.

One other thing to point out is that some of the "abuse" may be inadvertent. Coders are sloppy and occasionally (more often than one would hope) things start leaking memory and machine will start thrashing on the disk. Add to that impromper monitoring and if you are on the lucky duckies to be on the same piece of physical hardware as them your performance will go down the drain. I recall a tweet some time ago where the person was scratching his head that untarring a tarball on one EC2 instance took 15 minutes and on another 45 minutes.

There are certainly solutions to these problems however they require a lot more work. I think clouds are great :-) and use them extensively however you should be aware of some of the drawbacks. It also helps if you are designing your app in such a way that it doesn't rely on a centralized relational database (often a bottleneck).

Cloud cartography – load based co-residence detection

Friday, December 4th, 2009

Some weeks ago @krishnan and I had a tweet conversation regardinga claim he heard at an Amazon webcast where the speaker claimed that cloud cartography attacks are impossible due to Amazon's use of virtual interfaces to separate customers traffic. I responded that any such claim should make anyone sceptical (not in those words :-) ). Specifically I cited that the paper addresses other ways of detection ie.

Section 8.2 - Load-based co-residence detection

I have written in the past about Cloud Computing's Achilles Heel which dealt with performance degradation in case there is misbehaving instance running on the same piece of hardware as your own instance. I did not think of cartography in those cases but today while making a large back up of a virtual instance I thought let's try the load-based co-residence detection :-) so on a different virtual instance running on the same machine I typed

dd if=/dev/zero of=testfile bs=1M count=15000

This simply creates a 15G file with zeroes in it. Check out what happens to the network performance of the machine that was being backed up

Network Performance degradation with misbehaving instance

Performance dives from about average of 15 Mbytes/s to between 0 and 2 Mbytes/sec. For completeness here is the CPU utilization graph

CPU utilization with misbehaving client

I was actually quite surprised at the magnitude of degradation. I'd say this may be even a more successful co-residence detection attack than network probing since you could generate legitimate HTTP traffic to a site of interest (or a node of interest), throw tons of load at it and see if you notice response degradation.

There are obviously ways to mitigate some of these issues ie. control tightly who can connect to your instances within the cloud, cycle your own instances so that they keep "moving around", etc. Unfortunately it does come at a price of additional complexity and work.

Quick way to determine SSL certificate expiration

Tuesday, December 1st, 2009

If you need a quick way to determine when a certain SSL certificate expires you can utilize following approaches. In both examples server I am trying to check is called webserver.domain.com.

If you have Nagios plugins installed you could type

# /usr/lib/nagios/plugins/check_http -p 443 -S -C 15 webserver.domain.com
CRITICAL - Certificate expired on 11/01/2009 11:23.

That's easy. However what if you don't have Nagios plugins. In that case you can do the same with OpenSSL and s_client. Look for notAfter field.

# echo | openssl s_client -connect webserver.domain.com:443 | openssl x509 -noout -dates
...
notBefore=Nov  1 11:23:30 2008 GMT
notAfter=Nov  1 11:23:30 2009 GMT

Easy :-) .