Quantcast

Archive for the ‘Monitoring’ Category

Analyzing your backend web page response times

Thursday, July 15th, 2010

I have blogged about in the past about some of the ways you can monitor your web site performance e.g how to monitor your site using 90th percentile response times, beauty of aggregate line graphs and tracking web clients in real time.

Most recently we wanted to get better insight into how our site and more specifically backend is performing. We wanted a tool that could provide us with per URL/page metrics such as

  • total number of requests
  • aggregate compute time
  • average request time
  • 90th percentile time (you can find more explanation what it means at monitor your site using 90th percentile response times) - this eliminates most of the really slow response times that may really affect your averages

Initial plan was to build a basic set of reports to tell us what are the pages with excessive response times or large total (aggregate) compute times. Next and yet to be implemented portion was to be able to analyze data in real time so that we'd have another data point to use in troubleshooting in case there is a site slow down.

Basic requirements for the tool were these

  • Capable of crunching 100+ million daily entries
  • Real-time analysis
  • Produce multiple metrics with potential to add more down the line
  • Low footprint

An obvious way to do this is to store all data in a heavy duty data store like a relational/SQL database or something MapReduce capable. Trouble is we may be doing in logging in excess of 3,000 hits per second (all dynamic content as static assets are served from the CDN). Doing that many inserts per second on a SQL-type database will be tricky unless you have powerful hardware. Next obvious problem is to scan through hundreds of millions or billions of rows will be slow even if I use MapReduce unless of course you throw tons of hardware at it. We wanted a low footprint remember.

Instead we decided to go with a key/value store. Major pluses were that footprint is relatively low and it performs very fast. Downside was I would not be able to run any sophisticated queries. Since we already have an app that uses memcached to give us real-time view per IP number of accesses we ended up using it for this purpose as well.

Implementation

I have been working for a while now with ganglia-logtailer which is a Python framework to crunch log data and submit it to Ganglia. There are a number of good pieces from it we could reuse and we did. What we ended up is a two part tool. A Python based log parsing piece and a PHP based web GUI and computation part. Division of "labor" was roughly this

  • Python part parses the logs and creates entries/keys where the value in each key represent all the response times observed on a particular server and URL in a particular time period ie. one hour
  • PHP part takes the list once the time period has ended, calculates total time, average time and 90th percentile times and stores computed values in memcache so that retrieval later can be quicker.

Graphing is achieved using simple CSS graphs while time based series are done using OpenFlashChart. I did look at Dygraphs for Javascript/DHTML based graphing however couldn't figure how to plot hourly values. I could only do daily values.

Tool is operational and so far it has led us to the realization that our mobile web pages are overall much slower than their corresponding web pages. This is due to the way we handle mobile ads since most feature phones don't support Javascript so we have to download the ad which introduces a slight delay. We did figure out that we could use Javascript on Webkit browsers similar to what we do for regular browsers so that should help a bit. We are also chasing some of the other "leads" regarding inconsistent performance for particular pages on some of the servers.

Next steps are to adapt parsing code to work with ganglia-logtailer which would give us real-time reporting. I don't expect too many problems with that. Also graphing could use some more love. Perhaps I'll even do standard deviation calculations :-) .

Anyways you can download source code from here

http://github.com/vvuksan/pagetime-analyzer

You know what to do :-) .

Obligatory screenshots

Hourly overview sorted by aggregate time in seconds (you can sort by any column)

This is the average response time (over an hour) for a particular URL on separate server instances

Daily view of performance for a particular URL

Store your cron output for analysis and correlation with cronologger

Tuesday, July 6th, 2010

For the longest time I have wanted to get rid of dozen or so cron messages I receive every morning about things like DB backups, DB cleanups/vacuums, reporting etc. There are a number of solutions out there to help you manage the cron spam such as cronic, shush and cronwrap. They help by e-mailing you only if there is a problem however don't store the cron output itself. To get around that issue I have developed cronologger which can be downloaded from

http://github.com/vvuksan/cronologger

Cronologger is a BASH script that stores all the cron output into a database. I am using CouchDB since it is a great document oriented database that allows me to add attachments (blobs) to a document. I assume it would not be hard to use MongoDB, Riak and others.

Some of the benefits of this utility are

  • Reduce cron spam
  • Provide the ability to correlate adverse affects by overlaying cron events on e.g. Ganglia graphs
  • Provide a better report of all the batch jobs that ran, diff them with past jobs if they should look the same, etc.
  • Provide the ability to easily view what is currently running on the whole infrastructure ie. job_duration < 0
  • Review historical output

I am still working on web GUI for most of these things. I will gladly accept patches and new contributions.

Tip: To get view a list of documents in a CouchDB database you can use the _utils view e.g. http://localhost:5984/_utils/

Overlay deploy timeline on Ganglia graphs

Monday, June 28th, 2010

Don't you sometimes wish you could have a visual indicator of when code has been deployed in production. Something like this :-)

Shows deploy time line on a load graph

This is how you can add deploy timeline to your Ganglia graphs or for that matter to any tool that uses RRDs such as Cacti, Munin, Collectd etc.

Background

RRDtool supports so called VRULEs which are

VRULE:time#color[:legend][:dashes[=on_s[,off_s[,on_s,off_s]...]][:dash-offset=offset]]

Draw a vertical line at time. Its color is composed from three hexadecimal numbers specifying the rgb color components (00 is off, FF is maximum) red, green and blue followed by an optional alpha. Optionally, a legend box and string is printed in the legend section. time may be a number or a variable from a VDEF. It is an error to use vnames from DEF or CDEF here. Dashed lines can be drawn using the dashes modifier. See LINE for more details.

What we want to do is add a VRULE for each deployment. For example those three lines above have been generated using these VRULEs

VRULE:1277731886#FF00FF:"Deploys" VRULE:1277721886#FF00FF VRULE:1277711886#FF00FF

Implementation

Easiest way to add these to Ganglia is to modify graph.php in Ganglia Web. You need to look for following two lines at the end of the file

$command .=  array_key_exists('extras', $rrdtool_graph) ? ' '.$rrdtool_graph['extras'].' ' : '';
$command .=  " $rrdtool_graph[series]";

Then append your own VRULEs ie.

$command .= " VRULE:" . $time . "#FF00FF:\"Deploys\"";

Obviously you have to pull in the $time info from where you keep track of your deploy times. You can also get creative by using different colors for different deploys, change legend labels, add VRULEs to only certain graphs ie. load, CPU etc. This is a quick and dirty way to do it

$deploy_times = array(1278082860,1279393200);
foreach ( $deploy_times as $key => $time ) {
  # Put deploys label only once.
  if ( $key == 0 )
     $command .= " VRULE:" . $time . "#FF00FF:\"Deploys\"";
  else
     $command .= " VRULE:" . $time . "#FF00FF";
}

Now you just have to make sure you append deploy times in the array.

Alternate implementations

Alternate implementation is to create a RRD file whenever you do deploys then overlay that graph on top of an existing graph. Trouble is you have to worry about scaling the graph. Never could get it quite right.

Credit

Thanks goes to the Circonus guys :-) since they made me think of vertical lines instead of trying the RRD overlay. Also thanks to @toredash for pointing me in the right RRDtool direction by suggesting HRULE.

Beauty of aggregate line graphs

Saturday, June 5th, 2010

If you saw a graph like this

90th percentile response time consolidated line graph

Would it mean anything to you :-) ? First time I was introduced to it I thought they were pointless since you couldn't really see much. That was until I saw something like this

Netstat consolidated line graph

This was was post release. Can you spot something wrong :-) ? Obviously color scheme is somewhat off in the last graph which we later reworked (visible in the top graph). We then have another set of graphs where you can drill down per host aggregations as we are running multiple Resin instances on the same machine so you could find the misbehaving instance.

You can make these graphs pretty easily by using Ganglia's custom report graphs. I will try and post some of the ones we use in next couple days.

For those wondering what is 90th percentile response time you can read my Monitoring your website performance via 90th percentile response time.

Tracking web clients in real time

Tuesday, April 20th, 2010

Most recently I have been working on being able to more quickly identify abusers of our service ie. spammers, crawlers etc. We already have a process that rotates web logs on all web servers hourly then processes them extracting per IP access info. On occasion abusers get quite aggressive and cause some of our alarms to go off by causing excessive number of log errors etc. Trouble is that due to logs being processed on the hour there is a window of time where we may spend extra time trying to track down the cause of log errors. I figured it would help if the IP tracker was real-time. Luckily we have already been using a package called Ganglia Logtailer

http://bitbucket.org/maplebed/ganglia-logtailer/

which processes our web logs every minute and publishes metrics such as number of HTTP 200/300/400/500 hits, average and 90th percentile response time. All I had to do was send the IP data to a storage engine of my choice. Initially I thought I could use mySQL however decided against it due to following reasons

  1. Currently we can get up to 2500 hits/sec so processing them on the minute would result in roughly 150k inserts which mySQL may have some trouble processing in short amount of time.
  2. I don't need this data after couple hours.

I looked at Redis which has some interesting features around sets however I decided to use memcached since we were already using it and if I ever wanted to use a more persistent storage engine I could replace it with memcachedb or Tokyo Cabinet with no changes to the code.

Implementation

Implementation consists of two pieces

1. Modified Ganglia Logtailer class that inserts data into memcached. You can find a VarnishMemcacheLogtailer class on the Bit Bucker logtailer site which implements this. All you have to do is modify the location of the memcached server (set to localhost). Current implementation aggregates data per hour ie. all the numbers are hourly numbers. It would be trivial to do it for 10 minute or 1 minute periods.

2. Client application that displays data from memcached. I wrote a PHP interface that shows top 20 IPs from the web servers that can be downloaded from here

http://bitbucket.org/vvuksan/realtime-iptracker

Tracker looks something like this

Update: I do realize Splunk would be great for this kind of a purpose. Trouble is that for the amount of logs we create we'd have to get a really large Splunk license and those are quite expensive.

Monitoring your website performance via 90th percentile response time

Friday, January 15th, 2010

There are numerous ways to monitor the health and performance of your web site. Some of the popular ways are

  • measure response time of a particular URL on your site. If it exceeds a threshold (which is site dependent) it is time to investigate
  • compare pertinent metrics such as the number of created sessions, http connections, etc.
  • watch CPU utilization/load of the machine

Unfortunately most of these are flawed since they don't provide you with the most important metric and that is how fast is the site for you customers. Above metrics are not useless and do help paint the picture but they may provide you a false sense of how fast your site is since the URL you are checking may be behaving quite fast however some other part of the site due to a newly introduced feature may be behaving terribly. I have found one of the best metrics to watch is the 90th percentile request response time. Basically, you take every request passing through your web servers, log the time it takes to serve them, sort them from fastest to slowest then take the 90th percentile time. Therefore if your 90th percentile is 1 second it means that 90% of the requests have been served in under a second and 10% in more than a second. You may be asking yourself "so what?". Here is why ?

So for at least couple minutes 10% of your visitors/requests were waiting for more than 17 seconds to have their requests served. That can't be good for business and you may want to investigate the cause.

You could also consolidate response times from different web servers on one graph and you get this.

It may not look like much but it is pretty clear if an individual web server starts acting up.

How do you get on the fun ? You can look at the steps how to add Apache real-time metrics which also covers the 90th percentile response time on this URL

http://vuksan.com/linux/ganglia/#Apache_Traffic_Stats

I want to thank Ben Hartshorne (@maplebed) for making me aware of this metric.

Nagios alerts based on Ganglia metrics

Monday, September 14th, 2009

Have you ever wanted to alert based on Ganglia metrics. Well you can :-)

You can find the source code here for the plug in here.

Instructions how to set it up are here.