Quantcast

Archive for the ‘Systems Management’ Category

Tracking web clients in real time

Tuesday, April 20th, 2010

Most recently I have been working on being able to more quickly identify abusers of our service ie. spammers, crawlers etc. We already have a process that rotates web logs on all web servers hourly then processes them extracting per IP access info. On occasion abusers get quite aggressive and cause some of our alarms to go off by causing excessive number of log errors etc. Trouble is that due to logs being processed on the hour there is a window of time where we may spend extra time trying to track down the cause of log errors. I figured it would help if the IP tracker was real-time. Luckily we have already been using a package called Ganglia Logtailer

http://bitbucket.org/maplebed/ganglia-logtailer/

which processes our web logs every minute and publishes metrics such as number of HTTP 200/300/400/500 hits, average and 90th percentile response time. All I had to do was send the IP data to a storage engine of my choice. Initially I thought I could use mySQL however decided against it due to following reasons

  1. Currently we can get up to 2500 hits/sec so processing them on the minute would result in roughly 150k inserts which mySQL may have some trouble processing in short amount of time.
  2. I don't need this data after couple hours.

I looked at Redis which has some interesting features around sets however I decided to use memcached since we were already using it and if I ever wanted to use a more persistent storage engine I could replace it with memcachedb or Tokyo Cabinet with no changes to the code.

Implementation

Implementation consists of two pieces

1. Modified Ganglia Logtailer class that inserts data into memcached. You can find a VarnishMemcacheLogtailer class on the Bit Bucker logtailer site which implements this. All you have to do is modify the location of the memcached server (set to localhost). Current implementation aggregates data per hour ie. all the numbers are hourly numbers. It would be trivial to do it for 10 minute or 1 minute periods.

2. Client application that displays data from memcached. I wrote a PHP interface that shows top 20 IPs from the web servers that can be downloaded from here

http://bitbucket.org/vvuksan/realtime-iptracker

Tracker looks something like this

Update: I do realize Splunk would be great for this kind of a purpose. Trouble is that for the amount of logs we create we'd have to get a really large Splunk license and those are quite expensive.

Building Redhat/CentOS KVM images on Ubuntu 9.10

Thursday, March 11th, 2010

This is a quick recipe on how to create a Redhat/CentOS KVM image on Ubuntu 9.10 (karmic). First make sure you have Virtualization (VT) turned on. For example Dell laptops will have it disabled by default. Go into BIOS and enable it. To check whether it is turned on run

egrep '(vmx|svm)' /proc/cpuinfo

If this comes out empty VT is not enabled and KVM will not work.

Install kvm packages

sudo apt-get install qemu-kvm

Edit /etc/qemu-ifup to add virbr0 as the bridge to which KVM guest should attach itself. Comment out line below and add lines below e.g.

#/usr/sbin/brctl addif ${switch} $1
/usr/sbin/brctl addif virbr0 $1

Same change needs to be done in /etc/qemu-ifdown ie.

#/usr/sbin/brctl delif ${switch} $1
/usr/sbin/brctl delif virbr0 $1

Download CentOS 5.4 Boot ISO image e.g.

wget http://www.gtlib.gatech.edu/pub/centos/5.4/isos/x86_64/CentOS-5.4-x86_64-netinstall.iso

Create an empty image (last argument is the image size)

kvm-img create -f qcow2 centos5.img 10G

Launch install (-m is memory size)

sudo kvm -hda centos5.img -cdrom boot.iso -m 512 -boot d \
       -net nic,vlan=0,model=e1000,macaddr=00:16:3e:de:00:01 -net tap

Install CentOS however you like. When you are done your CentOS install will reboot and try to boot off the CD-ROM. At this point shut down the KVM guest by closing the window. To run it remove the cdrom references and boot option e.g.

sudo kvm -hda centos5.img -m 512 \
       -net nic,vlan=0,model=e1000,macaddr=00:16:3e:de:00:01 -net tap

Note: I am setting a fixed MAC address. You can leave it off and it will be generated randomly every time you start up kvm instance.

Monitoring your website performance via 90th percentile response time

Friday, January 15th, 2010

There are numerous ways to monitor the health and performance of your web site. Some of the popular ways are

  • measure response time of a particular URL on your site. If it exceeds a threshold (which is site dependent) it is time to investigate
  • compare pertinent metrics such as the number of created sessions, http connections, etc.
  • watch CPU utilization/load of the machine

Unfortunately most of these are flawed since they don't provide you with the most important metric and that is how fast is the site for you customers. Above metrics are not useless and do help paint the picture but they may provide you a false sense of how fast your site is since the URL you are checking may be behaving quite fast however some other part of the site due to a newly introduced feature may be behaving terribly. I have found one of the best metrics to watch is the 90th percentile request response time. Basically, you take every request passing through your web servers, log the time it takes to serve them, sort them from fastest to slowest then take the 90th percentile time. Therefore if your 90th percentile is 1 second it means that 90% of the requests have been served in under a second and 10% in more than a second. You may be asking yourself "so what?". Here is why ?

So for at least couple minutes 10% of your visitors/requests were waiting for more than 17 seconds to have their requests served. That can't be good for business and you may want to investigate the cause.

You could also consolidate response times from different web servers on one graph and you get this.

It may not look like much but it is pretty clear if an individual web server starts acting up.

How do you get on the fun ? You can look at the steps how to add Apache real-time metrics which also covers the 90th percentile response time on this URL

http://vuksan.com/linux/ganglia/#Apache_Traffic_Stats

I want to thank Ben Hartshorne (@maplebed) for making me aware of this metric.

Quick way to determine SSL certificate expiration

Tuesday, December 1st, 2009

If you need a quick way to determine when a certain SSL certificate expires you can utilize following approaches. In both examples server I am trying to check is called webserver.domain.com.

If you have Nagios plugins installed you could type

# /usr/lib/nagios/plugins/check_http -p 443 -S -C 15 webserver.domain.com
CRITICAL - Certificate expired on 11/01/2009 11:23.

That's easy. However what if you don't have Nagios plugins. In that case you can do the same with OpenSSL and s_client. Look for notAfter field.

# echo | openssl s_client -connect webserver.domain.com:443 | openssl x509 -noout -dates
...
notBefore=Nov  1 11:23:30 2008 GMT
notAfter=Nov  1 11:23:30 2009 GMT

Easy :-) .

Don’t let mySQL substitute engines

Thursday, November 5th, 2009

Word of warning to all who use mySQL (yes you poor souls). By default mySQL 5.0 and 5.1 will substitute storage engines if the one you requested is not available. It doesn't happen too often but when it does happen it is quite bad. For instance when setting up a new mySQL database something went wrong during creation of InnoDB logs and thus mySQL decided to DISABLE InnoDB storage. Unfortunately this was not caught and DBs were built that really needed InnoDB storage engine since they required foreign keys and other fun stuff. In their "awesomeness" mySQL developers decided that the default behavior should be to simply substitute (replace) InnoDB with myISAM. There is a warning however no error message is displayed and an import will continue unabated. Thus in my case things worked for a while until oddities were discovered which were traced back to the engine substitution. Unfortunately at that point it is fairly difficult to fix the problems since some of the constraints may be broken.

To avoid such a situation make sure you add following statement to my.cnf

sql_mode="NO_ENGINE_SUBSTITUTION"

To verify what engines are active on mySQL shell prompt type

SHOW ENGINES

Infrastructure redundancy is not cheap

Tuesday, October 6th, 2009

There was quite a discussion on Twitter about the BitBucket outage which initially appeared to be failure of Amazon EC2/EBS. More about the outage can be found here. Brett Piatt was kind enough to write up his view of the situation

http://www.bretpiatt.com/blog/2009/10/03/availability-is-a-fundamental-design-concept/

In principal I do agree with his suggestions and his conclusion ie. that availability is a fundamental design concept. I do however disagree that "warm" redundancy is cheap. In my own view and experience redundancy is extremely expensive if you are going to do it right. Redundancy is not just being able to add more hardware, systems and monitoring software and failover policies but a matter of process where you continuously have to make sure that the redundancy works. For instance successful backup strategy doesn't consist of simply getting a backup device yet never testing the backups by doing an actual restore. As many organizations have discovered backups do break, media gets corrupted, etc. and you can suffer a devastating blow. So if you want to do redundancy right you have to invest lots and lots of time practicing. For example running fire drills is a useful tool or doing periodic site failovers ie. run on site A for two weeks, then during low traffic times failover to site B, run for two week then back to site A and on and on. That certainly ain't cheap.

I'd also point out that "warm" redundancy is in lots of instances riskier than "hot" redundancy since you may discover that redundancy doesn't work when you have to failover whereas in "hot" redundancy issues may crop up much earlier allowing you to stay on top more readily.

That said the discussions over how you are responsible for your own availability reminds of "individual responsibility" (for my international readers this is something that is a hot topic in the United States). Sure you should "own" your redundancy however that may often be impractical or too expensive. Not everyone is blessed with copious resources.

Nagios alerts based on Ganglia metrics

Monday, September 14th, 2009

Have you ever wanted to alert based on Ganglia metrics. Well you can :-)

You can find the source code here for the plug in here.

Instructions how to set it up are here.