Quantcast

Beauty of aggregate line graphs

June 5th, 2010

If you saw a graph like this

90th percentile response time consolidated line graph

Would it mean anything to you :-) ? First time I was introduced to it I thought they were pointless since you couldn't really see much. That was until I saw something like this

Netstat consolidated line graph

This was was post release. Can you spot something wrong :-) ? Obviously color scheme is somewhat off in the last graph which we later reworked (visible in the top graph). We then have another set of graphs where you can drill down per host aggregations as we are running multiple Resin instances on the same machine so you could find the misbehaving instance.

You can make these graphs pretty easily by using Ganglia's custom report graphs. I will try and post some of the ones we use in next couple days.

For those wondering what is 90th percentile response time you can read my Monitoring your website performance via 90th percentile response time.

Devops homebrew part deux

May 27th, 2010

This is the second part to the devops homebrew post.

I forgot couple things in my first post so here are couple other observations

Change is an ongoing process

All the changes I talked about in the first post took a long time. It took more than a year to get issues assessed, discussed, designed, implemented and tested so don't expect quick progress. It's like an open heart surgery where you don't have time stop everything and start from scratch.

No hardcoded paths

Perhaps this one should be obvious however it is really important to make the app relocatable ie. app should assume all the files it needs are within it's container. This means that every file reference should be relative to the base container directory e.g. all the WARs and configuration files should be placed in /run/base and startup script would pass that as a variable ie. -DBASEDIR=/run/base. Application should then use BASEDIR instead of /run/base.

Tools, tools, tools

One of the critical operations responsibilities is providing and building tools for use by other groups such as technical support, development, QA etc. This goes beyond using tools such as configuration management and deployment but also building tools that enable other groups to do their jobs more effectively. For instance at one job we used to interface to hundreds of external LDAP/IMAP sources for authentication/authorization purposes. This was fraught with problems since often these services would e.g. misconfigure firewalls (not whitelist the right IP), have expired or self-signed SSL certificates, use wrong LDAP base DNs etc. This would chew up a lot of professional services, dev and ops time since looking at the application logs often gave incomplete answers. Also it could take couple iterations to fix the problem chewing up even more time. We ended up building a simple web page that enabled professional services to quickly validate the service ie. does DNS resolve, can I open up a TCP connection to the target port, is SSL certificate expired etc. This greatly reduced work load and time to resolution. In another job technical support would often need production settings however due to compliance reasons couldn't have unfettered access to the systems. For them we built a web app that allowed read-only view to the needed settings. I'm sure you can think of other cases where little automation can yield you huge efficiencies.

Use underpowered QA environments

This may be controversial since lots of people are of the opinion that you should try to have as close to the exact replica of production in QA. This is true if you are doing performance tests however if you have an underpowered environment some issues are likely to crop up that otherwise wouldn't. It is very hard to simulate production load so having underpowered environments gives you valuable data points. For example our primary QA environment ran on couple virtualized servers with modest disk space allocation ie. 10 GB. On more than one occasion we caught serious code deficiencies when the growing query log (turned on in QA) triggered low disk space alerts. If we had bigger disks we may have missed these. This doesn't preclude having a separate environment just for running performance test just use the underpowered environment for everything else.

Dev vs ops

There is often conflict between dev and ops due to stereotypes, poor communication but very often misaligned business goals. For instance I have very often seen/experienced conflict with devs when they were under intense pressure to deliver a feature on a tight deadline. This often happens in startups that cater to large businesses, universities or government organizations where a large sales deal is contingent on a particular feature being implemented. It leads to poor implementation, QA, production issues etc. which coupled with poor division of labor causes frustration and resentment. Being woken up numerous times in the middle of night due to a production issue quickly wears people out. Therefore it is important to strike a balance between ops and dev goals and overall business goals.

One of the possible approaches is to get together and discuss following issues

  • Ops, dev and QA should jointly assess new product functionality and how it affects each of these groups. Very often product management and sales and marketing will discuss new features only with dev who may not appreciate the difficulty of certain ops decisions.
  • Division of responsibility - discuss whose responsibility is to fix things when they break. There is a spectrum here where ops can do first level troubleshooting then hand it off to developers to developers running and deploying in production and ops providing a supportive role running services and tools that enable the application
  • Off hours coverage - this is probably the most contentious one since no one likes being woken up at night however developers should be on hook for "pager duty". It doesn't have to be regularly but at least once in a while. That is really only way for them to walk in ops shoes. For some organizations this may be a non-issue since their stuff never breaks in off hours ;-) .
  • Ops should involve devs in running the production by educating them about monitoring and performance gathering systems so that they can see effect of their coding first hand. For instance you can implement "monitoring duty" where each week someone different from either dev or ops team is tasked to review performance metrics looking for things that are out of whack.
  • Discuss how you can make each other life's easier. There are always areas where you can complement each others skills and create something that helps everyone.
  • Most important don't forget that a dose of humility goes a long way :-) .

Vonage the new Baby Bell

May 13th, 2010

It is sometimes amazing to me how new upstarts morph into their own arch enemies. Case in point is Vonage. For years I used to have Vonage service at home as a backup phone service. I was on a 500 minute plan for $14.99+taxes. This was a great plan for me as I didn't use the phone much. However at some point they decided that was too little money and they hiked up the price to $16.99 (something like that). It may seem like a small difference but I figured I may be better of elsewhere. I ended up switching to Galaxy Voice which I am using to this day since they had more flexible calling plans.

We recently expanded our office space and we needed a phone line added to a conference room. Since I had my old Vonage adapter at home I figured I would bring it and we'd use it. I thought it would be as easy as going to Vonage's web site, supplying the phone adapter ID and my credit card number and I would be set. It wasn't so. After entering the phone ID I got this message

The MAC address you entered is associated with an existing Vonage account. Please call our Customer Care department at 1-866-293-5676 for immediate assistance.

I called the number and spoke to someone in Customer service. This took about 20 minutes while the person kept re-asking for the same data and concluded that they couldn't help me and that I would have to talk to tech support. Tech support guy was equally unhelpful. Basically I could not activate a device that was ever used before since the system "knew" about it. Talk about having a piece of useless technological trash. At that point I was sufficiently frustrated to end the call. I tweeted about my experience and a day later I was contacted by Vonage's Twitter team about having someone at customer service contact me. I thought I'd give it a go. I got a call and this experience was not a whole lot better than the previous ones. Person kept asking me for my personal information including name, billing address, what was the credit card number I used for paying bills and the e-mail address I used. Since this was more than a year ago and I have dozens of e-mail addresses I said I couldn't remember. At that point I ended the call since I was sufficiently frustrated. I was willing to give these people money yet they were making me jump through all this hoops. I don't get it.

It occurred to me later that this was very similar to experiences that I had with a local phone company when I would move and I would have to get through all these bureaucratic hoops to make sure all my features stayed the same after I moved.

Installing RedHat 6 Enterprise DomU under Xen

May 11th, 2010

Recently I downloaded RedHat 6 Enteprise beta (RHEL6). I wanted to install it as a Xen guest (DomU) on top of an existing Centos 5 Xen host. Unfortunately it did not work out of the box. I ran

virt-install --prompt

on the Xen host which let me install RHEL6 however when the install rebooted I was greeted with this error message

fs = fsimage.open(file, get_fs_offset(file))
IOError: [Errno 95] Operation not supported

Fortunately Karanbir Singh had a blog post about this at

http://www.karan.org/blog/index.php/2010/04/28/rhel6-xen-domu-on-a-centos-5-dom0

Differences I found were that I had to make the root partition an ext2 filesystem as well. Also I found out that I couldn't review the partition layout if I ran the installation in the text mode. I had to use VNC to be able to set proper partition types.

Customizing iomega StorCenter ix4-200d with ipkg

April 28th, 2010

I have the iomega StorCenter ix4-200d. It is a nice little NAS with a number of decent features including rsync server etc. Unfortunately there were couple things I wanted fixed since for example rsync was at version 2.6.9 which does not support incremental updates. Machine runs a custom Linux distribution so I figured someone must have figured out how to customize it. I found part of the answer here

www.krausam.de/?p=33

To enable SSH you need to log in as administrator to your StorCenter then go to https://<storcenterIP>/support.html. Turn on SSH access. StorCenter will reboot. Then you will be able to ssh into the box as root where password is your admin password with soho prepended ie. if your web gui password is secret then root password is sohosecret.

Post has a way to bootstrap Debian on the box however I found an easier solution ie. StorCenter ships with ipkg utility which is similar to apt-get and yum commands. To enable proper repositories I searched and found them here

http://forum.synology.com/enu/viewtopic.php?f=40&t=5823

Easy way to add them is cut and paste following

cat <<EOF > /etc/ipkg.conf
src cross http://ipkg.nslu2-linux.org/feeds/optware/cs08q1armel/cross/unstable
src native http://ipkg.nslu2-linux.org/feeds/optware/cs08q1armel/native/unstable
EOF

Then type

ipkg update

After that you can check the list of available packages by typing

ipkg list | less

To install packages type

ipkg install <package_name>

Please note that packages are installed in /opt so adjust paths properly ie. screen is installed in

/opt/bin/screen

Hope this helps someone

Tracking web clients in real time

April 20th, 2010

Most recently I have been working on being able to more quickly identify abusers of our service ie. spammers, crawlers etc. We already have a process that rotates web logs on all web servers hourly then processes them extracting per IP access info. On occasion abusers get quite aggressive and cause some of our alarms to go off by causing excessive number of log errors etc. Trouble is that due to logs being processed on the hour there is a window of time where we may spend extra time trying to track down the cause of log errors. I figured it would help if the IP tracker was real-time. Luckily we have already been using a package called Ganglia Logtailer

http://bitbucket.org/maplebed/ganglia-logtailer/

which processes our web logs every minute and publishes metrics such as number of HTTP 200/300/400/500 hits, average and 90th percentile response time. All I had to do was send the IP data to a storage engine of my choice. Initially I thought I could use mySQL however decided against it due to following reasons

  1. Currently we can get up to 2500 hits/sec so processing them on the minute would result in roughly 150k inserts which mySQL may have some trouble processing in short amount of time.
  2. I don't need this data after couple hours.

I looked at Redis which has some interesting features around sets however I decided to use memcached since we were already using it and if I ever wanted to use a more persistent storage engine I could replace it with memcachedb or Tokyo Cabinet with no changes to the code.

Implementation

Implementation consists of two pieces

1. Modified Ganglia Logtailer class that inserts data into memcached. You can find a VarnishMemcacheLogtailer class on the Bit Bucker logtailer site which implements this. All you have to do is modify the location of the memcached server (set to localhost). Current implementation aggregates data per hour ie. all the numbers are hourly numbers. It would be trivial to do it for 10 minute or 1 minute periods.

2. Client application that displays data from memcached. I wrote a PHP interface that shows top 20 IPs from the web servers that can be downloaded from here

http://bitbucket.org/vvuksan/realtime-iptracker

Tracker looks something like this

Update: I do realize Splunk would be great for this kind of a purpose. Trouble is that for the amount of logs we create we'd have to get a really large Splunk license and those are quite expensive.

Devops homebrew

April 9th, 2010

There has been quite a bit of discussion about Devops and what it means. @blueben has suggested we start a Devops patterns cookbook so people can learn what worked or didn't work. This is the description of the environment we implemented at a previous job. Some of these things may or may not work for you. I will try to keep it short.

Environment background

7 distinct applications/products that had to be deployed and tested ie. base/core application, messaging platform, reporting app etc. All applications were Java based running on either Tomcat or Jboss.

Application design for deployment

These are some of the key points

  1. Application should have a sane default configuration options. Any option should be overrideable by an external file. In most cases you only need to override database credentials (host, username, password). Goal is to be able to use the same binary across multiple environments.
  2. Application should expose key internal metrics. We for instance asked for a simple key/value pairs web page ie. JMSenqueue=OK etc. This is important because there are lots of things that can break inside the application which external monitoring may miss like JMS message can't be enqueued, etc.
  3. Keep release notes actions to a minimum. Release notes are often not followed or partially followed thus make sure point 1. is followed and/or try to automate everything else.

Continuous Integration

We used CruiseControl for Continuous Integration. It was used solely to make sure that someone didn't break the build.

Creating releases

Developers are in charge of building and packaging releases. This primarily because QA or Ops will not know what to do if a build fails (this is Java remember). Each release has to be clearly labeled with the version and tagged in the repository. For example Location 1.1.5 will be packaged as location-1.1.5.tar.gz. Archives should contain only WAR (Tomcat) or EAR (Jboss) files and DB patch files. Releases are to be deposited into an appropriate file share ie. /share/releases/location.

Deployment

In order to eliminate most manual deployment steps and support all the different applications we decided to write our own deployment tool. First we started off with a data model which roughly broke down to

  1. Applications – can use different app server containers ie. Tomcat/JBoss, may/will have configuration files that can be either key/value pairs or templates. For every application we also specified a start and stop script (hotdeploy was not an option due to bad experiences with our code).
  2. Domains/Customers – we wanted a single Dashboard that would allow us to deploy to multiple environments e.g. QA staging (current release), QA development (next scheduled release), Dev playbox, etc. Each of these domains had their own set of applications they could deploy with their own configuration options

First we wrote a command line tool that was capable of doing something like this

$ deployer –version 1.2.5 –server web10 –domain joedev –app base –action deploy 

What this would do is

  1. Find and unpack the proper app server container e.g. jboss-4.2.3.tar.gz
  2. Overlay WAR/EAR files for the name version e.g. base-1.2.5.tar.gz
  3. Build configuration files and scripts
  4. Stop the server on the remote box (if it's running)
  5. Rsync the contents of the packaged release
  6. Make sure Apache AJP proxy is configured to proxy traffic and do Apache reload
  7. Start up the server

One of the main reason we started off with a command line tool is that we could easily write batch scripts to upgrade whole set of machines. This was borne out of pain of having to upgrade 200 instances via a web GUI at another job.

Once deployer was working we wrote a web GUI that interfaced with it. You could do things like View running config (what config options are actually on the appserver), Stop, Restart, Deploy (particular version), Reconfig (apply config changes) and Undeploy. We also added the ability to change or add configuration options to the application specific override files. Picture is worth thousand words. This is a tiny snippet how it approximately looked for one domain

This was a big win since QA or developers no longer needed to have someone from ops deploy software.

DB patching

Another big win was "automated" DB patching. Every application would have a table called Patch with a list of DB patches that were already applied. We also agreed that every app would have dbpatches directory in the app archive which would contain a list of patches named with version and order in which they should be applied e.g.

  • 2.54.01-addUserColumn.sql
  • 2.54.02-dropUidColumn.sql

During deployment startup script would compare contents of the patch table and a list of dbpatches and apply any missing ones. If the patch script failed e-mail would be sent to the QA or dev in charge of particular domain.

A slightly modified process was used in production to try to reduce down time ie. things like adding a column could be done at any time. Automated process was largely there to make QA's job easier.

QA and testing

When a release was ready QA would deploy the release themselves. If there was a deployment problem they would attempt to troubleshoot it themselves then contact the appropriate person. Most of the times it was an app problem ie. particular library didn't get commited etc. This was a huge win since we avoided a lots of "waterfall" problems by allowing QA to self-service themselves.

Production

Production environment was strictly controlled. Only ops and couple key engineers had access to it. Reason was we tried to keep the environment as stable as possible. Thus ad hoc changes were frowned upon. If you needed to make a change you would either have to commit a change into the configuration management system (puppet) or use the deployment tool.

Production deployment

The day before the release QA would open up a ticket listing all the applications and versions that needed to be deployed. On the morning of the deployment (that was our low time) someone from ops, development and whole QA team engaged in deploying the app and resolving any observed issues.

Monitoring

Regular metrics such as CPU utilization, load etc. were collected. In addition we kept track of internal metrics and set up adequate alerts. This is an ongoing process since over time you discover what your key metrics are and what their thresholds are ie. number of threads, number of JDBC connections etc.

Things that didn't work so well or were challenging

  1. One of the toughest parts was getting developers' attention to add "goodies" for ops. Specifically exposing application internals was often put off until eventually we would have an outage and lack of having the metric resulted in extended outage.
  2. Deployment tool took couple tries to get right. Even as it was there were couple things I would have done differently ie. not relying on a relational database for the data model since it made it difficult to create diffs (you had to dump the whole DB). I'd likely go with JSON so that diffs could be easily reviewed and committed.
  3. Other issues I can't recall right now :-)

Wrapup

This is the shortest description I could write. There are a number of things I glossed over and omitted so that this is not too long. I may write about those on another occasion. Perhaps the key take away should be that Ops should focus on developing tools that either automate things or allow its customers (QA, dev, technical support, etc.) to self-service themselves.

Update: There is a second part to this posts

Devops religion wars

April 6th, 2010

I have been trying to stay out of the devops arguments but it seems that they are slowly devolving into religious wars. It seems that each group ie. devops and non-devops is convinced that they are in possession of "eternal self-evident truths" and that everyone else is unenlightened hater or similar.  Proof in point is following post

http://brian.moonspot.net/devops-dealnews

Brian describes their devops process which seems reasonable to me. What is most important is that it works for him, his group and his site.

Unfortunately comments devolve from there. A non-devops person raises a good point about the process however does it with poor style and insulting language. Response is to compare devops and non-devops approach with giving man a fish vs. teaching someone to fish. It goes from there. It's all just too silly. Firstly I am not aware of definite devops definition. Secondly every environment is different. What may work for you may not work everywhere else. I really doubt that continuous deployment would work if your web app was used in providing emergency medical care. That said things have changed and availability expectations have increased so cooperation between development and ops is critical. Therefore let's try to stop with the silly arguments and try to learn from each other. Most of all avoid insulting language. I realize we all get frustrated at times but it really devalues your view.

Performance case for private clouds

March 12th, 2010

Couple weeks ago I read this post on the memcached mailing list. Key quote

We currently run a cluster of aproximately 40 memcache servers with about 6.5 gb of ram each machine using m1.medium ec2 instances. I was in the process of reducing the number of servers while increasing the memory size for each from 6 to about 30gb. Now i've started noticing that some servers seem to hit certain bandwidth limitations not consistenly though since i have some servers pushing 6mb/sec and some at 4mb having packet los and tcp timeouts.

.....

I've replaced the instances hoping this will give me an instance on a better area or on a less congested switch but i still have the issue on the same server.

This surprised me at first since my understanding was that as EC2 instances get bigger there are less and less "neighbors" on compute nodes you have to deal with and in theory less chance they may impact your performance. Yet this person was having more issues with bigger instances than smaller. Thinking about it some more I realized that the reality may be exactly opposite ie. bigger the size the likelihood is that the instance is going to be used for a "big" workload e.g. a busy relational database. This could lead to inconsistent node performance. Inconsistent node performance is a bad thing since it makes troubleshooting problems much harder and also provides poor end user experience. Providing slow/substandard performance to fraction of your visitors may not seem like much but if you are a retailer it's lost sales.

Another thing to note is that  lots of performance problems are subtle. Just the other day we had an issue when upgrading our F5 load balancers. We upgraded from version 9.4.3 to 10.1. Upgrade was mostly uneventful and everything seemed to work however after about a day we observed a fall in traffic going to our web caching tier. It looked like this

Request to our web caching pool

We also had another graph from a different source that "corroborated" this behavior. We spent a lot of time trying to identify what the problem was since F5 wouldn't believe there was a problem since the only evidence were couple graphs. To cut the long story short the upgrade "fixed" a behavior where certain objects were served out of F5 memory instead of being passed onto our web caching tier. It was apparently broken in the previous release and we didn't even know it. There are plenty of other cases where things have "broken" and only by observing metrics we were able to determine that there is an actual problem. Having inconsistent behavior makes that job extremely difficult if not impossible since it may be much harder to isolate problems.

Getting back to the initial problem one the obvious strategies is to keep cycling machines until you get more performance but as evidenced by the poster that was less than successful. Also what happens if after you have filled up your 30 GB memcache your performance degrades. What then ? You could try and launch another machine but that may spike up the load on your database server. Not a pleasant set of options.

Instead what you could do is following

  1. Find cloud providers that don't use virtualization (I have heard that they exist even though they are like the Bigfoot, hard to find) but deploy directly onto raw hardware. This will eliminate most of inconsistent node performance issues. The downside is it may be more expensive.
  2. Stick with virtualization but implement a private cloud where you have more insight into load on underlying hardware and control how images are deployed onto host machines. More on this point later.
  3. Hybrid between approach 1. and 2.

I personally think hybrid approach may be best since some workloads are best handled without going through the virtualization layer. As far as virtualization is concerned it is best to strategically place services based on resource utilization. Network services will fit into 3 broad levels of utilization ie. high resource utilization services such as relational databases, application servers, medium resource utilization services such as web servers and low resource utilization services such as DNS, monitoring, memcache etc. Trouble with public clouds is that you have no insight or say on what type of workloads are run. Instead if you had deployment control you could pair a relational DB image and memcache image on the same physical piece of hardware. That would likely work fine. If the performance of one degraded you could take appropriate action  ie. move memcache image or look for the root cause of performance degradation. Since you have access to the underlying hardware you can isolate problems which will surely help in getting down to the root of the problem. The cons of the approach are increased complexity, cost and additional management overhead.

Even if you choose to adopt the above approach you still could use public clouds for things like static content storage, image resizing, development and QA systems etc. For really critical operations I would stick with raw hardware and/or private clouds.

Building Redhat/CentOS KVM images on Ubuntu 9.10

March 11th, 2010

This is a quick recipe on how to create a Redhat/CentOS KVM image on Ubuntu 9.10 (karmic). First make sure you have Virtualization (VT) turned on. For example Dell laptops will have it disabled by default. Go into BIOS and enable it. To check whether it is turned on run

egrep '(vmx|svm)' /proc/cpuinfo

If this comes out empty VT is not enabled and KVM will not work.

Install kvm packages

sudo apt-get install qemu-kvm

Edit /etc/qemu-ifup to add virbr0 as the bridge to which KVM guest should attach itself. Comment out line below and add lines below e.g.

#/usr/sbin/brctl addif ${switch} $1
/usr/sbin/brctl addif virbr0 $1

Same change needs to be done in /etc/qemu-ifdown ie.

#/usr/sbin/brctl delif ${switch} $1
/usr/sbin/brctl delif virbr0 $1

Download CentOS 5.4 Boot ISO image e.g.

wget http://www.gtlib.gatech.edu/pub/centos/5.4/isos/x86_64/CentOS-5.4-x86_64-netinstall.iso

Create an empty image (last argument is the image size)

kvm-img create -f qcow2 centos5.img 10G

Launch install (-m is memory size)

sudo kvm -hda centos5.img -cdrom boot.iso -m 512 -boot d \
       -net nic,vlan=0,model=e1000,macaddr=00:16:3e:de:00:01 -net tap

Install CentOS however you like. When you are done your CentOS install will reboot and try to boot off the CD-ROM. At this point shut down the KVM guest by closing the window. To run it remove the cdrom references and boot option e.g.

sudo kvm -hda centos5.img -m 512 \
       -net nic,vlan=0,model=e1000,macaddr=00:16:3e:de:00:01 -net tap

Note: I am setting a fixed MAC address. You can leave it off and it will be generated randomly every time you start up kvm instance.