Quantcast

Archive for October, 2009

Infrastructure redundancy is not cheap

Tuesday, October 6th, 2009

There was quite a discussion on Twitter about the BitBucket outage which initially appeared to be failure of Amazon EC2/EBS. More about the outage can be found here. Brett Piatt was kind enough to write up his view of the situation

http://www.bretpiatt.com/blog/2009/10/03/availability-is-a-fundamental-design-concept/

In principal I do agree with his suggestions and his conclusion ie. that availability is a fundamental design concept. I do however disagree that "warm" redundancy is cheap. In my own view and experience redundancy is extremely expensive if you are going to do it right. Redundancy is not just being able to add more hardware, systems and monitoring software and failover policies but a matter of process where you continuously have to make sure that the redundancy works. For instance successful backup strategy doesn't consist of simply getting a backup device yet never testing the backups by doing an actual restore. As many organizations have discovered backups do break, media gets corrupted, etc. and you can suffer a devastating blow. So if you want to do redundancy right you have to invest lots and lots of time practicing. For example running fire drills is a useful tool or doing periodic site failovers ie. run on site A for two weeks, then during low traffic times failover to site B, run for two week then back to site A and on and on. That certainly ain't cheap.

I'd also point out that "warm" redundancy is in lots of instances riskier than "hot" redundancy since you may discover that redundancy doesn't work when you have to failover whereas in "hot" redundancy issues may crop up much earlier allowing you to stay on top more readily.

That said the discussions over how you are responsible for your own availability reminds of "individual responsibility" (for my international readers this is something that is a hot topic in the United States). Sure you should "own" your redundancy however that may often be impractical or too expensive. Not everyone is blessed with copious resources.

Keeping an eye on binary log growth

Thursday, October 1st, 2009

Recently I got a report that some pages on the site were extremely slow. Looking at the web server metrics didn't show anything new however mySQL DB metrics showed a definite change

MySQL server CPU utilization

MySQL server CPU utilization

ie. at the end of Week 38 there is an increase in CPU utilization. Nearly 60% increase. Interestingly enough there was a new software release at the end of Week 38 which pointed to either a bug or a new feature. Luckily I have been collecting mySQL metrics using this gmetric script. This led me to these two graphs

mysqlupdate

mysqlinsert

So nearly double number of inserts and nearly triple the updates. Using mysqlbinlog I analyzed the update and insert statements and was able to identify the two culprit INSERT and UPDATE statements then sent it off to developers.

I also observed that had I watched the binary log growth I may have identified this earlier since there were a lot more binary logs for the period since the release. Thus mysql average binary log growth rate gmetric was born :-) . Now all I need to do is find out what normal growth rate is and if it goes outside of that norm use Nagios to send me a non-urgent alert.

How can you make clouds better

Thursday, October 1st, 2009

I have perhaps been overly critical of clouds. I do not think clouds are useless. They certainly have their place and usefulness. I just dislike the hype around the clouds since I find it completely misplaced. I see clouds primarily as a way of easily "creating" and "disposing" of hardware ie. you need extra couple machines you can at a press of a button create them. When you are done you can dispose of them. However they also have some major drawbacks which I have alluded in the past ie. in Cloud Computing's Achilles Heel and Trouble with Cloud Computing.

That said there are a number of ways clouds can be improved. Some may be impractical, some may be expensive and some may be overly complicated but they are certainly options. Shall we go through the list :-)

  1. Don't use virtualization - you could certainly implement a cloud where an instance you get runs on raw hardware. That way you are guaranteed to have a dedicated piece of hardware that is not affected by other users' (mis)use. This would be a lot more expensive but some people may be willing to pay for it
  2. Intelligent web traffic load balancing - one of the big issues in general is that load balancing methods such as round-robin, server with least connections are imperfect since a server may get slow for a number of reasons and a portion of your clients may receive "substandard" response. Chances of that happening in shared environments is greater. Thus devising a load balancer which can "intelligently" figure out which server is slow and deal with it accordingly by either taking it out of the pool or sending less traffic to it.
  3. Similar to 2. for your relational DB traffic you would have to use a DB cluster and devise a way of preferring "faster" DB servers. This part in some ways "scares" me the most since if you start using synchronous replication you have to wait until all members of the cluster have commited the change. If you start doing asynchronous replication you will run the risk of DB inconsistencies which you will have to resolve.

There are likely other options but you can see from above that this gets real complicated, real quick. It can be done it is just a lot of work and a lot of QA. Hopefully someone comes up with a generic solution for problems 2. and 3. and those become non-issues.