Quantcast

Archive for the ‘Systems Management’ Category

Provision to cloud in 5 minutes using fog

Tuesday, July 20th, 2010

Most recently I have been working on disaster recovery project where we are assembling documentation, processes and code to be able to fire up our whole environment in the cloud in case of a major disaster. At Velocity Conference I met Wesley Beary who is the main developer for fog, a Ruby cloud computing library. What appealed to me about fog is that it has varying support for different clouds so that we are not stuck using a provider due to our non-portable code. Now off to couple quick example to get you going.

To install fog you will need to install Ruby Gems. If you have them type

  sudo gem install fog

The install may fail if you don't have the libxslt and libxml2 dev libraries. On my Ubuntu laptop I resolved it by doing

  sudo apt-get install libxslt1-dev libxml2-dev

On Centos/RHEL 5 I had to do

   yum install libxslt-devel libxml2-devel

Create a file called config.rb which contains your credentials e.g.

#!/usr/bin/ruby

@aws_access_key_id = "XXXXXXXXXXXXXXXXXX"
@aws_secret_access_key = "AXXZZZZZZZZZZZZZZZZZZ"
@aws_region = "us-east-1"

Let's start with the basics. Let's get our currently running instances and what images are available

#!/usr/bin/ruby

require 'rubygems'
require 'fog'

# Import EC2 credentials e.g. @aws_access_key_id and @aws_access_key_id
require './config.rb'

# Set up a connection
connection = Fog::AWS::EC2.new(
    :aws_access_key_id => @aws_access_key_id,
    :aws_secret_access_key => @aws_secret_access_key )

# Get a list of all the running servers/instances
instance_list = connection.servers.all

num_instances = instance_list.length
puts "We have " + num_instances.to_s()  + " servers"

# Print out a table of instances with choice columns
instance_list.table([:id, :flavor_id, :ip_address, :private_ip_address, :image_id ])

###################################################################
# Get a list of our images
###################################################################
my_images_raw = connection.describe_images('Owner' => 'self')
my_images = my_images_raw.body["imagesSet"]

puts "\n###################################################################################"
puts "Following images are available for deployment"
puts "\nImage ID\tArch\t\tImage Location"

#  List image ID, architecture and location
for key in 0...my_images.length
  print my_images[key]["imageId"], "\t" , my_images[key]["architecture"] , "\t\t" , my_images[key]["imageLocation"],  "\n";
end

Let's spin up a m1.large instance

#!/usr/bin/ruby
require 'rubygems'
require 'fog'
# Import EC2 credentials e.g. @aws_access_key_id and @aws_access_key_id
require './config.rb'

# Set up a connection
connection = Fog::AWS::EC2.new(
 :aws_access_key_id => @aws_access_key_id,
 :aws_secret_access_key => @aws_secret_access_key )

server = connection.servers.create(:image_id => 'ami-1234567',
 :flavor_id =>  'm1.large')

# wait for it to be ready to do stuff
server.wait_for { print "."; ready? }

puts "Public IP Address: #{server.ip_address}"
puts "Private IP Address: #{server.private_ip_address}"

This may take a while so please be patient.  You could obviously spin up a number of these instances without waiting for any of them to be available then use connection.servers.all to get a list of running instances.

Now let's destroy a running instance

#!/usr/bin/ruby
require 'rubygems'
require 'fog'
# Import EC2 credentials e.g. @aws_access_key_id and @aws_access_key_id
require './config.rb'

# Set up a connection
connection = Fog::AWS::EC2.new(
    :aws_access_key_id => @aws_access_key_id,
    :aws_secret_access_key => @aws_secret_access_key )

instance_id = "1-123456"

server = connection.servers.get(instance_id)

puts "Flavor: #{server.flavor_id}"
puts "Public IP Address: #{server.ip_address}"
puts "Private IP Address: #{server.private_ip_address}"

server.destroy

There is tons more out there although this gets me going :-) . Now off to playing with R.I. Pienaar's ec2-boot-init.

Thanks to Wesley Beary for answering questions about fog and Ian Meyer for pointing out Chef Fog code.

#!/usr/bin/ruby

require 'rubygems'
require 'fog'
require 'pp'

# Import EC2 credentials e.g. @aws_access_key_id and @aws_access_key_id
require './config.rb'

# Set up a connection
connection = Fog::AWS::EC2.new(
:aws_access_key_id => @aws_access_key_id,
:aws_secret_access_key => @aws_secret_access_key )

# Get a list of all the running servers/instances
instance_list = connection.servers.all

num_instances = instance_list.length
puts "We have " + num_instances.to_s()  + " servers"

# Print out a table of instances with choice columns
instance_list.table([:id, :flavor_id, :ip_address, :private_ip_address, :image_id ])

###################################################################
# Get a list of our images
###################################################################
my_images_raw = connection.describe_images('Owner' => 'self')

my_images = my_images_raw.body["imagesSet"]

puts "\n###################################################################################"
puts "Following images are available for deployment"
puts "\nImage ID\tArch\t\tImage Location"

for key in 0...my_images.length
print my_images[key]["imageId"], "\t" , my_images[key]["architecture"] , "\t\t" , my_images[key]["imageLocation"],  "\n";
end

###################################################################
# Get a list of all instance flavors
###################################################################
flavors = connection.flavors()

print "\n\n============\nFlavors\n============\n"
#flavors.table([:bits, :cores, :disk, :ram, :name])
flavors.table

CouchDB views creation problems

Wednesday, July 14th, 2010

I have had a frustrating time creating views in CouchDB using curl. Executing following command I would get

$ curl -s -X PUT -H "text/plain;charset=utf-8" -d cronview.json http://localhost:5984/cronologger/_design/cronview
{"error":"bad_request","reason":"invalid UTF-8 JSON"}

I checked and rechecked JSON, used the same JSON using CouchDB's Futon to no avail. Finally I found the answer here

http://stackoverflow.com/questions/2461798/error-about-invalid-json-with-couchdb-view-but-the-jsons-fine

The -d option of curl expects the actual data as the argument!

If you want to provide the data in a file, you need to prefix it with @:

curl -X PUT -d @keys.json  $CDB/_design/id

Non-Dell SSDs/drives not supported until Q2 2011

Wednesday, June 16th, 2010

I am writing up this post so perhaps I can save some poor sysadmin from chasing their own tales. If you ever receive following error message using PERC H700 or H800 controllers

Jun 15 14:00:17 db07 Server Administrator:  Storage Service EventID: 2335  Controller event log: PD 04(e0x20/s4) is  not supported:  Controller 0 (PERC H700 Integrated)
Jun 15 14:00:18  db07 Server Administrator: Storage Service EventID: 2334  Controller  event log: Inserted: PD 05(e0x20/s5):  Controller 0 (PERC H700  Integrated)
Jun 15 14:00:18 db07 Server Administrator: Storage  Service EventID: 2335  Controller event log: PD 05(e0x20/s5) is not  supported:  Controller 0 (PERC H700 Integrated)

It is due to following

http://www.standalone-sysadmin.com/blog/2010/04/dell-reverses-position-on-3rd-party-drives/

Please note this will not be fixed until Q2 2011.

Beauty of aggregate line graphs

Saturday, June 5th, 2010

If you saw a graph like this

90th percentile response time consolidated line graph

Would it mean anything to you :-) ? First time I was introduced to it I thought they were pointless since you couldn't really see much. That was until I saw something like this

Netstat consolidated line graph

This was was post release. Can you spot something wrong :-) ? Obviously color scheme is somewhat off in the last graph which we later reworked (visible in the top graph). We then have another set of graphs where you can drill down per host aggregations as we are running multiple Resin instances on the same machine so you could find the misbehaving instance.

You can make these graphs pretty easily by using Ganglia's custom report graphs. I will try and post some of the ones we use in next couple days.

For those wondering what is 90th percentile response time you can read my Monitoring your website performance via 90th percentile response time.

Devops homebrew part deux

Thursday, May 27th, 2010

This is the second part to the devops homebrew post.

I forgot couple things in my first post so here are couple other observations

Change is an ongoing process

All the changes I talked about in the first post took a long time. It took more than a year to get issues assessed, discussed, designed, implemented and tested so don't expect quick progress. It's like an open heart surgery where you don't have time stop everything and start from scratch.

No hardcoded paths

Perhaps this one should be obvious however it is really important to make the app relocatable ie. app should assume all the files it needs are within it's container. This means that every file reference should be relative to the base container directory e.g. all the WARs and configuration files should be placed in /run/base and startup script would pass that as a variable ie. -DBASEDIR=/run/base. Application should then use BASEDIR instead of /run/base.

Tools, tools, tools

One of the critical operations responsibilities is providing and building tools for use by other groups such as technical support, development, QA etc. This goes beyond using tools such as configuration management and deployment but also building tools that enable other groups to do their jobs more effectively. For instance at one job we used to interface to hundreds of external LDAP/IMAP sources for authentication/authorization purposes. This was fraught with problems since often these services would e.g. misconfigure firewalls (not whitelist the right IP), have expired or self-signed SSL certificates, use wrong LDAP base DNs etc. This would chew up a lot of professional services, dev and ops time since looking at the application logs often gave incomplete answers. Also it could take couple iterations to fix the problem chewing up even more time. We ended up building a simple web page that enabled professional services to quickly validate the service ie. does DNS resolve, can I open up a TCP connection to the target port, is SSL certificate expired etc. This greatly reduced work load and time to resolution. In another job technical support would often need production settings however due to compliance reasons couldn't have unfettered access to the systems. For them we built a web app that allowed read-only view to the needed settings. I'm sure you can think of other cases where little automation can yield you huge efficiencies.

Use underpowered QA environments

This may be controversial since lots of people are of the opinion that you should try to have as close to the exact replica of production in QA. This is true if you are doing performance tests however if you have an underpowered environment some issues are likely to crop up that otherwise wouldn't. It is very hard to simulate production load so having underpowered environments gives you valuable data points. For example our primary QA environment ran on couple virtualized servers with modest disk space allocation ie. 10 GB. On more than one occasion we caught serious code deficiencies when the growing query log (turned on in QA) triggered low disk space alerts. If we had bigger disks we may have missed these. This doesn't preclude having a separate environment just for running performance test just use the underpowered environment for everything else.

Dev vs ops

There is often conflict between dev and ops due to stereotypes, poor communication but very often misaligned business goals. For instance I have very often seen/experienced conflict with devs when they were under intense pressure to deliver a feature on a tight deadline. This often happens in startups that cater to large businesses, universities or government organizations where a large sales deal is contingent on a particular feature being implemented. It leads to poor implementation, QA, production issues etc. which coupled with poor division of labor causes frustration and resentment. Being woken up numerous times in the middle of night due to a production issue quickly wears people out. Therefore it is important to strike a balance between ops and dev goals and overall business goals.

One of the possible approaches is to get together and discuss following issues

  • Ops, dev and QA should jointly assess new product functionality and how it affects each of these groups. Very often product management and sales and marketing will discuss new features only with dev who may not appreciate the difficulty of certain ops decisions.
  • Division of responsibility - discuss whose responsibility is to fix things when they break. There is a spectrum here where ops can do first level troubleshooting then hand it off to developers to developers running and deploying in production and ops providing a supportive role running services and tools that enable the application
  • Off hours coverage - this is probably the most contentious one since no one likes being woken up at night however developers should be on hook for "pager duty". It doesn't have to be regularly but at least once in a while. That is really only way for them to walk in ops shoes. For some organizations this may be a non-issue since their stuff never breaks in off hours ;-) .
  • Ops should involve devs in running the production by educating them about monitoring and performance gathering systems so that they can see effect of their coding first hand. For instance you can implement "monitoring duty" where each week someone different from either dev or ops team is tasked to review performance metrics looking for things that are out of whack.
  • Discuss how you can make each other life's easier. There are always areas where you can complement each others skills and create something that helps everyone.
  • Most important don't forget that a dose of humility goes a long way :-) .

Tracking web clients in real time

Tuesday, April 20th, 2010

Most recently I have been working on being able to more quickly identify abusers of our service ie. spammers, crawlers etc. We already have a process that rotates web logs on all web servers hourly then processes them extracting per IP access info. On occasion abusers get quite aggressive and cause some of our alarms to go off by causing excessive number of log errors etc. Trouble is that due to logs being processed on the hour there is a window of time where we may spend extra time trying to track down the cause of log errors. I figured it would help if the IP tracker was real-time. Luckily we have already been using a package called Ganglia Logtailer

http://bitbucket.org/maplebed/ganglia-logtailer/

which processes our web logs every minute and publishes metrics such as number of HTTP 200/300/400/500 hits, average and 90th percentile response time. All I had to do was send the IP data to a storage engine of my choice. Initially I thought I could use mySQL however decided against it due to following reasons

  1. Currently we can get up to 2500 hits/sec so processing them on the minute would result in roughly 150k inserts which mySQL may have some trouble processing in short amount of time.
  2. I don't need this data after couple hours.

I looked at Redis which has some interesting features around sets however I decided to use memcached since we were already using it and if I ever wanted to use a more persistent storage engine I could replace it with memcachedb or Tokyo Cabinet with no changes to the code.

Implementation

Implementation consists of two pieces

1. Modified Ganglia Logtailer class that inserts data into memcached. You can find a VarnishMemcacheLogtailer class on the Bit Bucker logtailer site which implements this. All you have to do is modify the location of the memcached server (set to localhost). Current implementation aggregates data per hour ie. all the numbers are hourly numbers. It would be trivial to do it for 10 minute or 1 minute periods.

2. Client application that displays data from memcached. I wrote a PHP interface that shows top 20 IPs from the web servers that can be downloaded from here

http://bitbucket.org/vvuksan/realtime-iptracker

Tracker looks something like this

Update: I do realize Splunk would be great for this kind of a purpose. Trouble is that for the amount of logs we create we'd have to get a really large Splunk license and those are quite expensive.

Building Redhat/CentOS KVM images on Ubuntu 9.10

Thursday, March 11th, 2010

This is a quick recipe on how to create a Redhat/CentOS KVM image on Ubuntu 9.10 (karmic). First make sure you have Virtualization (VT) turned on. For example Dell laptops will have it disabled by default. Go into BIOS and enable it. To check whether it is turned on run

egrep '(vmx|svm)' /proc/cpuinfo

If this comes out empty VT is not enabled and KVM will not work.

Install kvm packages

sudo apt-get install qemu-kvm

Edit /etc/qemu-ifup to add virbr0 as the bridge to which KVM guest should attach itself. Comment out line below and add lines below e.g.

#/usr/sbin/brctl addif ${switch} $1
/usr/sbin/brctl addif virbr0 $1

Same change needs to be done in /etc/qemu-ifdown ie.

#/usr/sbin/brctl delif ${switch} $1
/usr/sbin/brctl delif virbr0 $1

Download CentOS 5.4 Boot ISO image e.g.

wget http://www.gtlib.gatech.edu/pub/centos/5.4/isos/x86_64/CentOS-5.4-x86_64-netinstall.iso

Create an empty image (last argument is the image size)

kvm-img create -f qcow2 centos5.img 10G

Launch install (-m is memory size)

sudo kvm -hda centos5.img -cdrom boot.iso -m 512 -boot d \
       -net nic,vlan=0,model=e1000,macaddr=00:16:3e:de:00:01 -net tap

Install CentOS however you like. When you are done your CentOS install will reboot and try to boot off the CD-ROM. At this point shut down the KVM guest by closing the window. To run it remove the cdrom references and boot option e.g.

sudo kvm -hda centos5.img -m 512 \
       -net nic,vlan=0,model=e1000,macaddr=00:16:3e:de:00:01 -net tap

Note: I am setting a fixed MAC address. You can leave it off and it will be generated randomly every time you start up kvm instance.

Monitoring your website performance via 90th percentile response time

Friday, January 15th, 2010

There are numerous ways to monitor the health and performance of your web site. Some of the popular ways are

  • measure response time of a particular URL on your site. If it exceeds a threshold (which is site dependent) it is time to investigate
  • compare pertinent metrics such as the number of created sessions, http connections, etc.
  • watch CPU utilization/load of the machine

Unfortunately most of these are flawed since they don't provide you with the most important metric and that is how fast is the site for you customers. Above metrics are not useless and do help paint the picture but they may provide you a false sense of how fast your site is since the URL you are checking may be behaving quite fast however some other part of the site due to a newly introduced feature may be behaving terribly. I have found one of the best metrics to watch is the 90th percentile request response time. Basically, you take every request passing through your web servers, log the time it takes to serve them, sort them from fastest to slowest then take the 90th percentile time. Therefore if your 90th percentile is 1 second it means that 90% of the requests have been served in under a second and 10% in more than a second. You may be asking yourself "so what?". Here is why ?

So for at least couple minutes 10% of your visitors/requests were waiting for more than 17 seconds to have their requests served. That can't be good for business and you may want to investigate the cause.

You could also consolidate response times from different web servers on one graph and you get this.

It may not look like much but it is pretty clear if an individual web server starts acting up.

How do you get on the fun ? You can look at the steps how to add Apache real-time metrics which also covers the 90th percentile response time on this URL

http://vuksan.com/linux/ganglia/#Apache_Traffic_Stats

I want to thank Ben Hartshorne (@maplebed) for making me aware of this metric.

Quick way to determine SSL certificate expiration

Tuesday, December 1st, 2009

If you need a quick way to determine when a certain SSL certificate expires you can utilize following approaches. In both examples server I am trying to check is called webserver.domain.com.

If you have Nagios plugins installed you could type

# /usr/lib/nagios/plugins/check_http -p 443 -S -C 15 webserver.domain.com
CRITICAL - Certificate expired on 11/01/2009 11:23.

That's easy. However what if you don't have Nagios plugins. In that case you can do the same with OpenSSL and s_client. Look for notAfter field.

# echo | openssl s_client -connect webserver.domain.com:443 | openssl x509 -noout -dates
...
notBefore=Nov  1 11:23:30 2008 GMT
notAfter=Nov  1 11:23:30 2009 GMT

Easy :-) .

Don’t let mySQL substitute engines

Thursday, November 5th, 2009

Word of warning to all who use mySQL (yes you poor souls). By default mySQL 5.0 and 5.1 will substitute storage engines if the one you requested is not available. It doesn't happen too often but when it does happen it is quite bad. For instance when setting up a new mySQL database something went wrong during creation of InnoDB logs and thus mySQL decided to DISABLE InnoDB storage. Unfortunately this was not caught and DBs were built that really needed InnoDB storage engine since they required foreign keys and other fun stuff. In their "awesomeness" mySQL developers decided that the default behavior should be to simply substitute (replace) InnoDB with myISAM. There is a warning however no error message is displayed and an import will continue unabated. Thus in my case things worked for a while until oddities were discovered which were traced back to the engine substitution. Unfortunately at that point it is fairly difficult to fix the problems since some of the constraints may be broken.

To avoid such a situation make sure you add following statement to my.cnf

sql_mode="NO_ENGINE_SUBSTITUTION"

To verify what engines are active on mySQL shell prompt type

SHOW ENGINES