January 22nd, 2010
You know the pitch. Each time you create an account for a “secure” site you are forced to come up with a complex password ie. you need to have a number, a capitalized letter, perhaps a special character such as + or -. Trouble is policies differ so on one site password has to be a minimum length, maximum length, some don’t allow special characters etc. The thing is at one point in time this made sense and was required to keep basic security but it may not make sense today.
Ages ago computer systems (in particular UNIX systems) used to store passwords in a hashed format (hash . You can read more on cryptographic hashes on Wikipedia. The trouble is that these hashes were available for any user to see ie. you could copy a password file (/etc/passwd) or use YP/NIS tools to get a list of all passwords in an organization. Once you have the password file you do not know what the passwords are however you can take a word dictionary start computing hashes since a particular password will always convert to the same hash and compare it if there are any matches in your password file. If you find a match you know have “discovered” users password. This is often referred to as off-line password cracking since it allows you derive passwords without interacting with the target system. This has many advantages since you can try millions of passwords quickly and the target system’s administrator will not be alerted. Based on this fact password policies were instituted that mandated password complexity since passwords with complexity ie. 9pc_miu would be nearly impossible or very hard to break (it may take years to break it). This made sense then.
However it doesn’t make much sense now since on most systems regular users have no access to the password hashes. On UNIX systems “shadow” (/etc/shadow) is used to hide them or you may be using LDAP which has the capability of hiding password hashes, etc. The only users that have access to those hashes are administrator however they have other ways of acquiring your passwords. Thus your real exposures in order of importance are
- Trivial passwords or easily guessable password ie. 123456, 1234, date of birth
- Using same password across different sites ie. this is a problem if e.g. site A.com gets hacked and hackers are able to determine your password and log into site B.com
I actually feel that password complexity breeds poor security since people will write down complex passwords instead of remembering them. Just remember how many times have you seen passwords on post-it notes on someone’s monitor. Perhaps it is time to scrap the password complexity and use something simpler.
Posted in Uncategorized | No Comments »
January 20th, 2010
At a previous job for availability and business continuity reasons we set up a geographically redundant data center because even the best data centers will have outages. No matter what a vendor tells you processes are never followed fully. You can also have a major disaster with critical pieces of your hardware that may cripple or disable your whole infrastructure ie. switch goes crazy etc.
Service we provided was critical so highest availability was imperative. Management wanted an active-active set up ie. use both data centers in a load-balanced fashion however that would have entailed extensive application rewrite due to the nature of our application and the level of database transactions involved. Thus we settled on a hot-cold configuration where we would have an active site that was serving customers and a cold site that was kept up to date via replication. In case of trouble (as determined by ops) we would fail-over our hot site to the cold site. This is fairly straight forward except for the part where you are actually failing things over ie. your hot site is down, you break off replication, change DNS entries, start up all the necessary services however due to DNS caching some of your customers are still pointing to your “dead” site. Depending on your browser this could be 30 minutes+. Did I mention this service was critical ?
We went through the list of possible options on how to resolve this
1. Use an outside party load balancer(s) ie. an off-site load balancer(s) that would proxy traffic to the site that was live. This seemed like a plausible idea however we didn’t like the fact we were introducing yet another failure point and adding latency due to extra round-trip.
2. Changed DNS TTL to 2 minutes however that was also insufficient due to different browsers behavior. For example IE 6 (perhaps even higher) will cache DNS entries for 30 minutes
http://support.microsoft.com/kb/263558
3. Use round-robin DNS aka. multiple DNS A records with a “twist”
What we did there is put both of our data center’s IPs into the A record for our site ie.
www.domain.com IN A 1.2.3.4
www.domain.com IN A 9.8.7.6
What happens with most browsers is that they will attempt the first IP and if they get a connection refused they will try the next (and next if you have more than 2). This actually works quite well e.g. even if the browser was getting requests from 1.2.3.4 if 1.2.3.4 all of the sudden goes down it will in sub-second time fail-over to 9.8.7.6. The “twist” we added was that we only answered on the active colo IP and returned connection closed on the inactive. If we needed to failover we’d just swap one colo and deactivate the other. Quick failovers here we come
.
This all worked great for some time until we started receiving isolated reports that people weren’t able to access our site. Investigating the issue further we discovered that all of the people having connectivity issues were behind a transparent HTTP proxy. In this particular case the transparent proxy would not return connection refused but “page not found” or something similar neutralizing our clever hack
.
Obviously if you audience is different and you know your users don’t use proxies you could use this approach however this doomed it for us.
Posted in Networking | 4 Comments »
January 15th, 2010
There are numerous ways to monitor the health and performance of your web site. Some of the popular ways are
- measure response time of a particular URL on your site. If it exceeds a threshold (which is site dependent) it is time to investigate
- compare pertinent metrics such as the number of created sessions, http connections, etc.
- watch CPU utilization/load of the machine
Unfortunately most of these are flawed since they don’t provide you with the most important metric and that is how fast is the site for you customers. Above metrics are not useless and do help paint the picture but they may provide you a false sense of how fast your site is since the URL you are checking may be behaving quite fast however some other part of the site due to a newly introduced feature may be behaving terribly. I have found one of the best metrics to watch is the 90th percentile request response time. Basically, you take every request passing through your web servers, log the time it takes to serve them, sort them from fastest to slowest then take the 90th percentile time. Therefore if your 90th percentile is 1 second it means that 90% of the requests have been served in under a second and 10% in more than a second. You may be asking yourself “so what?”. Here is why ?

So for at least couple minutes 10% of your visitors/requests were waiting for more than 17 seconds to have their requests served. That can’t be good for business and you may want to investigate the cause.
You could also consolidate response times from different web servers on one graph and you get this.

It may not look like much but it is pretty clear if an individual web server starts acting up.
How do you get on the fun ? You can look at the steps how to add Apache real-time metrics which also covers the 90th percentile response time on this URL
http://vuksan.com/linux/ganglia/#Apache_Traffic_Stats
I want to thank Ben Hartshorne (@maplebed) for making me aware of this metric.
Tags: Monitoring
Posted in Monitoring, Systems Management | 1 Comment »
December 14th, 2009
I read with interest a post about measuring disk I/O performance on EC2.
http://stu.mp/2009/12/disk-io-and-throughput-benchmarks-on-amazons-ec2.html
It is a good test however results were not unexpected. The problem with shared infrastructure is not that it provides subpar performance but the fact that in any infrastructure which can be “modified” by a customer you will run into “abuse” where one or couple customers will use infrastructure unevenly and will affect other customers. I have blogged about virtualization stress points before ie.
http://vuksan.com/blog/2009/12/04/cloud-cartography-load-co-residence-detection/
http://vuksan.com/blog/2009/09/01/cloud_computings_achilles_heel/
I have also in the past been in charge of a operations for an e-commerce SaaS startup and we would see this issue quite often. For instance we had two customers that did about the same amount of yearly sales yet one of the customers’ infrastructure utilization (number of disk ops, DB bandwidth etc.) was 3-4 times higher than the other customer. At times they would “abuse” a shared database so much that it affected everyone else. We resolved it when a collegue figured out that we could QoS traffic to the database. That way only the abusing customer would be affected if they did anything crazy. It also helped that we ran the infrastructure and the application so we could quickly determine what is normal and what is not. I suspect this problem becomes much trickier in clouds since you have very little idea what applications are running and what is normal.
One other thing to point out is that some of the “abuse” may be inadvertent. Coders are sloppy and occasionally (more often than one would hope) things start leaking memory and machine will start thrashing on the disk. Add to that impromper monitoring and if you are on the lucky duckies to be on the same piece of physical hardware as them your performance will go down the drain. I recall a tweet some time ago where the person was scratching his head that untarring a tarball on one EC2 instance took 15 minutes and on another 45 minutes.
There are certainly solutions to these problems however they require a lot more work. I think clouds are great
and use them extensively however you should be aware of some of the drawbacks. It also helps if you are designing your app in such a way that it doesn’t rely on a centralized relational database (often a bottleneck).
Posted in Cloud Computing | No Comments »
December 4th, 2009
Some weeks ago @krishnan and I had a tweet conversation regardinga claim he heard at an Amazon webcast where the speaker claimed that cloud cartography attacks are impossible due to Amazon’s use of virtual interfaces to separate customers traffic. I responded that any such claim should make anyone sceptical (not in those words
). Specifically I cited that the paper addresses other ways of detection ie.
Section 8.2 – Load-based co-residence detection
I have written in the past about Cloud Computing’s Achilles Heel which dealt with performance degradation in case there is misbehaving instance running on the same piece of hardware as your own instance. I did not think of cartography in those cases but today while making a large back up of a virtual instance I thought let’s try the load-based co-residence detection
so on a different virtual instance running on the same machine I typed
dd if=/dev/zero of=testfile bs=1M count=15000
This simply creates a 15G file with zeroes in it. Check out what happens to the network performance of the machine that was being backed up

Performance dives from about average of 15 Mbytes/s to between 0 and 2 Mbytes/sec. For completeness here is the CPU utilization graph

I was actually quite surprised at the magnitude of degradation. I’d say this may be even a more successful co-residence detection attack than network probing since you could generate legitimate HTTP traffic to a site of interest (or a node of interest), throw tons of load at it and see if you notice response degradation.
There are obviously ways to mitigate some of these issues ie. control tightly who can connect to your instances within the cloud, cycle your own instances so that they keep “moving around”, etc. Unfortunately it does come at a price of additional complexity and work.
Posted in Cloud Computing | No Comments »
December 1st, 2009
If you need a quick way to determine when a certain SSL certificate expires you can utilize following approaches. In both examples server I am trying to check is called webserver.domain.com.
If you have Nagios plugins installed you could type
# /usr/lib/nagios/plugins/check_http -p 443 -S -C 15 webserver.domain.com
CRITICAL - Certificate expired on 11/01/2009 11:23.
That’s easy. However what if you don’t have Nagios plugins. In that case you can do the same with OpenSSL and s_client. Look for notAfter field.
# echo | openssl s_client -connect webserver.domain.com:443 | openssl x509 -noout -dates
...
notBefore=Nov 1 11:23:30 2008 GMT
notAfter=Nov 1 11:23:30 2009 GMT
Easy
.
Posted in Networking, Systems Management | No Comments »
November 5th, 2009
Word of warning to all who use mySQL (yes you poor souls). By default mySQL 5.0 and 5.1 will substitute storage engines if the one you requested is not available. It doesn’t happen too often but when it does happen it is quite bad. For instance when setting up a new mySQL database something went wrong during creation of InnoDB logs and thus mySQL decided to DISABLE InnoDB storage. Unfortunately this was not caught and DBs were built that really needed InnoDB storage engine since they required foreign keys and other fun stuff. In their “awesomeness” mySQL developers decided that the default behavior should be to simply substitute (replace) InnoDB with myISAM. There is a warning however no error message is displayed and an import will continue unabated. Thus in my case things worked for a while until oddities were discovered which were traced back to the engine substitution. Unfortunately at that point it is fairly difficult to fix the problems since some of the constraints may be broken.
To avoid such a situation make sure you add following statement to my.cnf
sql_mode=”NO_ENGINE_SUBSTITUTION”
To verify what engines are active on mySQL shell prompt type
SHOW ENGINES
Posted in Systems Management | No Comments »
October 6th, 2009
There was quite a discussion on Twitter about the BitBucket outage which initially appeared to be failure of Amazon EC2/EBS. More about the outage can be found here. Brett Piatt was kind enough to write up his view of the situation
http://www.bretpiatt.com/blog/2009/10/03/availability-is-a-fundamental-design-concept/
In principal I do agree with his suggestions and his conclusion ie. that availability is a fundamental design concept. I do however disagree that “warm” redundancy is cheap. In my own view and experience redundancy is extremely expensive if you are going to do it right. Redundancy is not just being able to add more hardware, systems and monitoring software and failover policies but a matter of process where you continuously have to make sure that the redundancy works. For instance successful backup strategy doesn’t consist of simply getting a backup device yet never testing the backups by doing an actual restore. As many organizations have discovered backups do break, media gets corrupted, etc. and you can suffer a devastating blow. So if you want to do redundancy right you have to invest lots and lots of time practicing. For example running fire drills is a useful tool or doing periodic site failovers ie. run on site A for two weeks, then during low traffic times failover to site B, run for two week then back to site A and on and on. That certainly ain’t cheap.
I’d also point out that “warm” redundancy is in lots of instances riskier than “hot” redundancy since you may discover that redundancy doesn’t work when you have to failover whereas in “hot” redundancy issues may crop up much earlier allowing you to stay on top more readily.
That said the discussions over how you are responsible for your own availability reminds of “individual responsibility” (for my international readers this is something that is a hot topic in the United States). Sure you should “own” your redundancy however that may often be impractical or too expensive. Not everyone is blessed with copious resources.
Posted in Cloud Computing, Systems Management | No Comments »
October 1st, 2009
Recently I got a report that some pages on the site were extremely slow. Looking at the web server metrics didn’t show anything new however mySQL DB metrics showed a definite change

MySQL server CPU utilization
ie. at the end of Week 38 there is an increase in CPU utilization. Nearly 60% increase. Interestingly enough there was a new software release at the end of Week 38 which pointed to either a bug or a new feature. Luckily I have been collecting mySQL metrics using this gmetric script. This led me to these two graphs


So nearly double number of inserts and nearly triple the updates. Using mysqlbinlog I analyzed the update and insert statements and was able to identify the two culprit INSERT and UPDATE statements then sent it off to developers.
I also observed that had I watched the binary log growth I may have identified this earlier since there were a lot more binary logs for the period since the release. Thus mysql average binary log growth rate gmetric was born
. Now all I need to do is find out what normal growth rate is and if it goes outside of that norm use Nagios to send me a non-urgent alert.
Posted in Uncategorized | No Comments »
October 1st, 2009
I have perhaps been overly critical of clouds. I do not think clouds are useless. They certainly have their place and usefulness. I just dislike the hype around the clouds since I find it completely misplaced. I see clouds primarily as a way of easily “creating” and “disposing” of hardware ie. you need extra couple machines you can at a press of a button create them. When you are done you can dispose of them. However they also have some major drawbacks which I have alluded in the past ie. in Cloud Computing’s Achilles Heel and Trouble with Cloud Computing.
That said there are a number of ways clouds can be improved. Some may be impractical, some may be expensive and some may be overly complicated but they are certainly options. Shall we go through the list
- Don’t use virtualization – you could certainly implement a cloud where an instance you get runs on raw hardware. That way you are guaranteed to have a dedicated piece of hardware that is not affected by other users’ (mis)use. This would be a lot more expensive but some people may be willing to pay for it
- Intelligent web traffic load balancing – one of the big issues in general is that load balancing methods such as round-robin, server with least connections are imperfect since a server may get slow for a number of reasons and a portion of your clients may receive “substandard” response. Chances of that happening in shared environments is greater. Thus devising a load balancer which can “intelligently” figure out which server is slow and deal with it accordingly by either taking it out of the pool or sending less traffic to it.
- Similar to 2. for your relational DB traffic you would have to use a DB cluster and devise a way of preferring “faster” DB servers. This part in some ways “scares” me the most since if you start using synchronous replication you have to wait until all members of the cluster have commited the change. If you start doing asynchronous replication you will run the risk of DB inconsistencies which you will have to resolve.
There are likely other options but you can see from above that this gets real complicated, real quick. It can be done it is just a lot of work and a lot of QA. Hopefully someone comes up with a generic solution for problems 2. and 3. and those become non-issues.
Posted in Cloud Computing | No Comments »