Quantcast

Archive for January, 2010

Password complexity madness

Friday, January 22nd, 2010

You know the pitch. Each time you create an account for a "secure" site you are forced to come up with a complex password ie. you need to have a number, a capitalized letter, perhaps a special character such as + or -. Trouble is policies differ so on one site password has to be a minimum length, maximum length, some don't allow special characters etc. The thing is at one point in time this made sense and was required to keep basic security but it may not make sense today.

Ages ago computer systems (in particular UNIX systems) used to store passwords in a hashed format (hash . You can read more on cryptographic hashes on Wikipedia. The trouble is that these hashes were available for any user to see ie. you could copy a password file (/etc/passwd) or use YP/NIS tools to get a list of all passwords in an organization. Once you have the password file you do not know what the passwords are however you can take a word dictionary start computing hashes since a particular password will always convert to the same hash and compare it if there are any matches in your password file. If you find a match you know have "discovered" users password. This is often referred to as off-line password cracking since it allows you derive passwords without interacting with the target system. This has many advantages since you can try millions of passwords quickly and the target system's administrator will not be alerted. Based on this fact password policies were instituted that mandated password complexity since passwords with complexity ie. 9pc_miu would be nearly impossible or very hard to break (it may take years to break it). This made sense then.

However it doesn't make much sense now since on most systems regular users have no access to the password hashes. On UNIX systems "shadow" (/etc/shadow) is used to hide them or you may be using LDAP which has the capability of hiding password hashes, etc. The only users that have access to those hashes are administrator however they have other ways of acquiring your passwords. Thus your real exposures in order of importance are

  • Trivial passwords or easily guessable password ie. 123456, 1234, date of birth
  • Using same password across different sites ie. this is a problem if e.g. site A.com gets hacked and hackers are able to determine your password and log into site B.com

I actually feel that password complexity breeds poor security since people will write down complex passwords instead of remembering them. Just remember how many times have you seen passwords on post-it notes on someone's monitor. Perhaps it is time to scrap the password complexity and use something simpler.

Cool DNS tricks you can’t use for fail-overs

Wednesday, January 20th, 2010

At a previous job for availability and business continuity reasons we set up a geographically redundant data center because even the best data centers will have outages. No matter what a vendor tells you processes are never followed fully. You can also have a major disaster with critical pieces of your hardware that may cripple or disable your whole infrastructure ie. switch goes crazy etc.

Service we provided was critical so highest availability was imperative. Management wanted an active-active set up ie. use both data centers in a load-balanced fashion however that would have entailed extensive application rewrite due to the nature of our application and the level of database transactions involved. Thus we settled on a hot-cold configuration where we would have an active site that was serving customers and a cold site that was kept up to date via replication. In case of trouble (as determined by ops) we would fail-over our hot site to the cold site. This is fairly straight forward except for the part where you are actually failing things over ie. your hot site is down, you break off replication, change DNS entries, start up all the necessary services however due to DNS caching some of your customers are still pointing to your "dead" site. Depending on your browser this could be 30 minutes+. Did I mention this service was critical ?

We went through the list of possible options on how to resolve this

1. Use an outside party load balancer(s) ie. an off-site load balancer(s) that would proxy traffic to the site that was live. This seemed like a plausible idea however we didn't like the fact we were introducing yet another failure point and adding latency due to extra round-trip.

2. Changed DNS TTL to 2 minutes however that was also insufficient due to different browsers behavior. For example IE 6 (perhaps even higher) will cache DNS entries for 30 minutes

http://support.microsoft.com/kb/263558

3. Use round-robin DNS aka. multiple DNS A records with a "twist"

What we did there is put both of our data center's IPs into the A record for our site ie.

www.domain.com   IN A 1.2.3.4
www.domain.com   IN A 9.8.7.6

What happens with most browsers is that they will attempt the first IP and if they get a connection refused they will try the next (and next if you have more than 2). This actually works quite well e.g. even if the browser was getting requests from 1.2.3.4 if 1.2.3.4 all of the sudden goes down it will in sub-second time fail-over to 9.8.7.6. The "twist" we added was that we only answered on the active colo IP and returned connection closed on the inactive. If we needed to failover we'd just swap one colo and deactivate the other. Quick failovers here we come :-) .

This all worked great for some time until we started receiving isolated reports that people weren't able to access our site. Investigating the issue further we discovered that all of the people having connectivity issues were behind a transparent HTTP proxy. In this particular case the transparent proxy would not return connection refused but "page not found" or something similar neutralizing our clever hack :-( .

Obviously if you audience is different and you know your users don't use proxies you could use this approach however this doomed it for us.

Monitoring your website performance via 90th percentile response time

Friday, January 15th, 2010

There are numerous ways to monitor the health and performance of your web site. Some of the popular ways are

  • measure response time of a particular URL on your site. If it exceeds a threshold (which is site dependent) it is time to investigate
  • compare pertinent metrics such as the number of created sessions, http connections, etc.
  • watch CPU utilization/load of the machine

Unfortunately most of these are flawed since they don't provide you with the most important metric and that is how fast is the site for you customers. Above metrics are not useless and do help paint the picture but they may provide you a false sense of how fast your site is since the URL you are checking may be behaving quite fast however some other part of the site due to a newly introduced feature may be behaving terribly. I have found one of the best metrics to watch is the 90th percentile request response time. Basically, you take every request passing through your web servers, log the time it takes to serve them, sort them from fastest to slowest then take the 90th percentile time. Therefore if your 90th percentile is 1 second it means that 90% of the requests have been served in under a second and 10% in more than a second. You may be asking yourself "so what?". Here is why ?

So for at least couple minutes 10% of your visitors/requests were waiting for more than 17 seconds to have their requests served. That can't be good for business and you may want to investigate the cause.

You could also consolidate response times from different web servers on one graph and you get this.

It may not look like much but it is pretty clear if an individual web server starts acting up.

How do you get on the fun ? You can look at the steps how to add Apache real-time metrics which also covers the 90th percentile response time on this URL

http://vuksan.com/linux/ganglia/#Apache_Traffic_Stats

I want to thank Ben Hartshorne (@maplebed) for making me aware of this metric.