Quantcast

Bootstraping your cloud environment with puppet and mcollective

July 28th, 2010

This is a "recipe" on how to bootstrap your whole environment in case of a disaster ie. your data center goes dark or if you are migrating from one environment to another. This guide differs from others in that it uses mcollective and DNS to provide you with greater flexibility in deploying and bootstraping environments. Some of the alternate ways are ec2-boot-init by R.I. Pienaar or Grig Gheorghiu's Bootstrapping EC2 images as Puppet clients.

Intro

You will need two disk images, your code repository and your DB backup and you can rebuild your whole environment from scratch in a relatively short period of time. This could be adapted to generic cloud provisioning however use case I'm trying to address is disaster recovery. We are using DNS so that we can keep hostnames consistent between environments ie. mail01 will be a mail server in all environments instead of domU-1-2-3-4 in one, rack-2345 in other etc.

Set up a master node image

Master node is the node that controls all the other nodes. Most importantly it contains all your configuration management data. You will need to install following

  • mcollective with ActiveMQ
  • DnsMasq
  • Puppet from Puppet Labs

1.  You will need to get a DNS name from a dynamic DNS provider such as DynDNS. Once you have that you will need to write a shell script that runs at boot and sets your EC2 private IP to that DNS name. Let's say we want our controller station to be known as controller.ec2.domain.com we can do something like this

IP=`facter ipaddress`
change_my_dns_ip controller.ec2.domain.com
# Delete any entries from hosts
sed -i "/controller.ec2.domain.com/d" /etc/hosts
echo "${IP}     controller.ec2.domain.com" >> /etc/hosts

2. Set up ActiveMQ to be used with mcollective http://code.google.com/p/mcollective/wiki/GettingStarted
3. Set up mcollective

Configure controller.ec2.domain.com as the stomp host in your mcollective configuration for both client and server configuration.

4.Install dnsmasq. You don't need to configure anything since by default dnsmasq will read /etc/hosts and serve those names over DNS

5. Install puppetmaster, configure it anyway you want

6. Image it

Set up a generic/worker node image

You will need to Install following

  • Mcollective
  • puppet agent

1. On the worker node you need to configure the server piece of mcollective and make sure the stomp.host is pointed to the master ie.  controller.ec2.domain.com.

2. Create a reboot agent (we'll discuss later how to use it). Please visit http://code.google.com/p/mcollective/wiki/SimpleRPCIntroduction for an example. Create a new file ie. reboot.rb. Paste this code in it

module MCollective
 module Agent
  class Reboot<RPC::Agent
    def reboot_action
     `/sbin/shutdown -r now`
    end
  end
 end
end

Copy the resulting file to the mcollective agents directory

3. Add following script to the bootup

MASTER=`host controller.ec2.domain.com | grep address | cut -f4 -d" "`
IS_ALREADY_SET=`grep -c ec2.domain.com /etc/resolv.conf`
if [ $IS_ALREADY_SET -lt 1 ]; then   
sed -i "s/^search .*/search ec2.domain.com/g" /etc/resolv.conf
sed -i "s/^nameserver/nameserver ${MASTER}\nnameserver/g" /etc/resolv.conf
fi
# Set Hostname
IP=`facter ipaddress`
MY_HOST=`/bin/ipcalc --silent --hostname ${IP} | cut -f2 -d=`
hostname ${MY_HOST}

What that does is point tells your worker nodes to use controller DNS for resolving names as well as setting your hostname.

4. Get the mcollective puppet plugin from github

5. Image it

Bringing up the environment

You will need to start the master instance first since that's the instance that everyone will be talking to. As soon as it's up you can start up as many instances as you'd like.

While you wait rsync your puppet manifests and configurations to the master node

To find out what nodes are up and available issue mc-ping from the master and you should get a response similar to this

# mc-ping
controller.ec2.domain.com               time=77.21 ms
domu-12-31-55-11-22-18.compute-1.internal time=188.76 ms

Trouble is that hostnames on the worker nodes are set to Amazon names. We want to make them recognizable e.g. mail01.

To do so simply add the IP of the worker instance and it's name into /etc/hosts on the master e.g.

echo "10.1.2.3      mail01.ec2.domain.com" >> /etc/hosts

Reload dnsmasq configuration ie.

/etc/init.d/dnsmasq reload

What this has bought you is reverse DNS resolution of the node.  To take effect you will need to reboot the worker node. We already have the reboot agent on the worker nodes so all we have to do is run following command on the master node

./mc-rpc -F hostname=domu-12-31-55-11-22-18 reboot reboot

This will seek out the domU-1-2-3-4 host and reboot it (--arg is irrelevant so put anything). Once the machine is up it will advertise it's new name :-) ie. running mc-ping will show you this

# mc-ping
controller.ec2.domain.com           time=47.59 ms
mail01.ec2.domain.com               time=80.71 ms

Now let's activate puppet. From master node run

# mc-puppetd -F hostname=mail01 runonce

 * [ ============================================================> ] 1 / 1

Finished processing 1 / 1 hosts in 1051.23 ms

Once that is done puppetca should give you this

# puppetca --list
mail01.ec2.domain.com

Sign it

# puppetca –sign mail01.ec2.domain.com

Now you can simply run

# mc-puppetd -F hostname=mail01 enable

and off you go. Now lather, rinse, repeat to get the rest of the instances going. You would certainly want to automate this further but I leave that exercise to you :-) .

If you are looking for an easy cross-cloud API check out my "Provision to cloud in 5 minutes using fog".

Next Boston DevOps meetup

July 21st, 2010

Next Boston DevOps meetup we'll try something new, Jeff Buchbinder of FreeMed Software fame and myself will talk about "Deploying your way into happiness". If you want flavor of the kinds of things we'll talk about you can check out my Devops homebrew post. We will go into much more detail with actual code snippets and some of the omitted nitty gritty details. We will also open the floor for questions.

Date for the meetup is August 3rd, 2010 from 6-8 pm and we'll be meeting at Microsoft's New England R&D center. I expect we'll start presenting around 6:45 or so.

Please register at

http://www.eventbrite.com/event/770217742

since we need to provide building security at NERD with the list of people attending.

Provision to cloud in 5 minutes using fog

July 20th, 2010

Most recently I have been working on disaster recovery project where we are assembling documentation, processes and code to be able to fire up our whole environment in the cloud in case of a major disaster. At Velocity Conference I met Wesley Beary who is the main developer for fog, a Ruby cloud computing library. What appealed to me about fog is that it has varying support for different clouds so that we are not stuck using a provider due to our non-portable code. Now off to couple quick example to get you going.

To install fog you will need to install Ruby Gems. If you have them type

  sudo gem install fog

The install may fail if you don't have the libxslt and libxml2 dev libraries. On my Ubuntu laptop I resolved it by doing

  sudo apt-get install libxslt1-dev libxml2-dev

On Centos/RHEL 5 I had to do

   yum install libxslt-devel libxml2-devel

Create a file called config.rb which contains your credentials e.g.

#!/usr/bin/ruby

@aws_access_key_id = "XXXXXXXXXXXXXXXXXX"
@aws_secret_access_key = "AXXZZZZZZZZZZZZZZZZZZ"
@aws_region = "us-east-1"

Let's start with the basics. Let's get our currently running instances and what images are available

#!/usr/bin/ruby

require 'rubygems'
require 'fog'

# Import EC2 credentials e.g. @aws_access_key_id and @aws_access_key_id
require './config.rb'

# Set up a connection
connection = Fog::AWS::EC2.new(
    :aws_access_key_id => @aws_access_key_id,
    :aws_secret_access_key => @aws_secret_access_key )

# Get a list of all the running servers/instances
instance_list = connection.servers.all

num_instances = instance_list.length
puts "We have " + num_instances.to_s()  + " servers"

# Print out a table of instances with choice columns
instance_list.table([:id, :flavor_id, :ip_address, :private_ip_address, :image_id ])

###################################################################
# Get a list of our images
###################################################################
my_images_raw = connection.describe_images('Owner' => 'self')
my_images = my_images_raw.body["imagesSet"]

puts "\n###################################################################################"
puts "Following images are available for deployment"
puts "\nImage ID\tArch\t\tImage Location"

#  List image ID, architecture and location
for key in 0...my_images.length
  print my_images[key]["imageId"], "\t" , my_images[key]["architecture"] , "\t\t" , my_images[key]["imageLocation"],  "\n";
end

Let's spin up a m1.large instance

#!/usr/bin/ruby
require 'rubygems'
require 'fog'
# Import EC2 credentials e.g. @aws_access_key_id and @aws_access_key_id
require './config.rb'

# Set up a connection
connection = Fog::AWS::EC2.new(
 :aws_access_key_id => @aws_access_key_id,
 :aws_secret_access_key => @aws_secret_access_key )

server = connection.servers.create(:image_id => 'ami-1234567',
 :flavor_id =>  'm1.large')

# wait for it to be ready to do stuff
server.wait_for { print "."; ready? }

puts "Public IP Address: #{server.ip_address}"
puts "Private IP Address: #{server.private_ip_address}"

This may take a while so please be patient.  You could obviously spin up a number of these instances without waiting for any of them to be available then use connection.servers.all to get a list of running instances.

Now let's destroy a running instance

#!/usr/bin/ruby
require 'rubygems'
require 'fog'
# Import EC2 credentials e.g. @aws_access_key_id and @aws_access_key_id
require './config.rb'

# Set up a connection
connection = Fog::AWS::EC2.new(
    :aws_access_key_id => @aws_access_key_id,
    :aws_secret_access_key => @aws_secret_access_key )

instance_id = "1-123456"

server = connection.servers.get(instance_id)

puts "Flavor: #{server.flavor_id}"
puts "Public IP Address: #{server.ip_address}"
puts "Private IP Address: #{server.private_ip_address}"

server.destroy

There is tons more out there although this gets me going :-) . Now off to playing with R.I. Pienaar's ec2-boot-init.

Thanks to Wesley Beary for answering questions about fog and Ian Meyer for pointing out Chef Fog code.

#!/usr/bin/ruby

require 'rubygems'
require 'fog'
require 'pp'

# Import EC2 credentials e.g. @aws_access_key_id and @aws_access_key_id
require './config.rb'

# Set up a connection
connection = Fog::AWS::EC2.new(
:aws_access_key_id => @aws_access_key_id,
:aws_secret_access_key => @aws_secret_access_key )

# Get a list of all the running servers/instances
instance_list = connection.servers.all

num_instances = instance_list.length
puts "We have " + num_instances.to_s()  + " servers"

# Print out a table of instances with choice columns
instance_list.table([:id, :flavor_id, :ip_address, :private_ip_address, :image_id ])

###################################################################
# Get a list of our images
###################################################################
my_images_raw = connection.describe_images('Owner' => 'self')

my_images = my_images_raw.body["imagesSet"]

puts "\n###################################################################################"
puts "Following images are available for deployment"
puts "\nImage ID\tArch\t\tImage Location"

for key in 0...my_images.length
print my_images[key]["imageId"], "\t" , my_images[key]["architecture"] , "\t\t" , my_images[key]["imageLocation"],  "\n";
end

###################################################################
# Get a list of all instance flavors
###################################################################
flavors = connection.flavors()

print "\n\n============\nFlavors\n============\n"
#flavors.table([:bits, :cores, :disk, :ram, :name])
flavors.table

Analyzing your backend web page response times

July 15th, 2010

I have blogged about in the past about some of the ways you can monitor your web site performance e.g how to monitor your site using 90th percentile response times, beauty of aggregate line graphs and tracking web clients in real time.

Most recently we wanted to get better insight into how our site and more specifically backend is performing. We wanted a tool that could provide us with per URL/page metrics such as

  • total number of requests
  • aggregate compute time
  • average request time
  • 90th percentile time (you can find more explanation what it means at monitor your site using 90th percentile response times) - this eliminates most of the really slow response times that may really affect your averages

Initial plan was to build a basic set of reports to tell us what are the pages with excessive response times or large total (aggregate) compute times. Next and yet to be implemented portion was to be able to analyze data in real time so that we'd have another data point to use in troubleshooting in case there is a site slow down.

Basic requirements for the tool were these

  • Capable of crunching 100+ million daily entries
  • Real-time analysis
  • Produce multiple metrics with potential to add more down the line
  • Low footprint

An obvious way to do this is to store all data in a heavy duty data store like a relational/SQL database or something MapReduce capable. Trouble is we may be doing in logging in excess of 3,000 hits per second (all dynamic content as static assets are served from the CDN). Doing that many inserts per second on a SQL-type database will be tricky unless you have powerful hardware. Next obvious problem is to scan through hundreds of millions or billions of rows will be slow even if I use MapReduce unless of course you throw tons of hardware at it. We wanted a low footprint remember.

Instead we decided to go with a key/value store. Major pluses were that footprint is relatively low and it performs very fast. Downside was I would not be able to run any sophisticated queries. Since we already have an app that uses memcached to give us real-time view per IP number of accesses we ended up using it for this purpose as well.

Implementation

I have been working for a while now with ganglia-logtailer which is a Python framework to crunch log data and submit it to Ganglia. There are a number of good pieces from it we could reuse and we did. What we ended up is a two part tool. A Python based log parsing piece and a PHP based web GUI and computation part. Division of "labor" was roughly this

  • Python part parses the logs and creates entries/keys where the value in each key represent all the response times observed on a particular server and URL in a particular time period ie. one hour
  • PHP part takes the list once the time period has ended, calculates total time, average time and 90th percentile times and stores computed values in memcache so that retrieval later can be quicker.

Graphing is achieved using simple CSS graphs while time based series are done using OpenFlashChart. I did look at Dygraphs for Javascript/DHTML based graphing however couldn't figure how to plot hourly values. I could only do daily values.

Tool is operational and so far it has led us to the realization that our mobile web pages are overall much slower than their corresponding web pages. This is due to the way we handle mobile ads since most feature phones don't support Javascript so we have to download the ad which introduces a slight delay. We did figure out that we could use Javascript on Webkit browsers similar to what we do for regular browsers so that should help a bit. We are also chasing some of the other "leads" regarding inconsistent performance for particular pages on some of the servers.

Next steps are to adapt parsing code to work with ganglia-logtailer which would give us real-time reporting. I don't expect too many problems with that. Also graphing could use some more love. Perhaps I'll even do standard deviation calculations :-) .

Anyways you can download source code from here

http://github.com/vvuksan/pagetime-analyzer

You know what to do :-) .

Obligatory screenshots

Hourly overview sorted by aggregate time in seconds (you can sort by any column)

This is the average response time (over an hour) for a particular URL on separate server instances

Daily view of performance for a particular URL

CouchDB views creation problems

July 14th, 2010

I have had a frustrating time creating views in CouchDB using curl. Executing following command I would get

$ curl -s -X PUT -H "text/plain;charset=utf-8" -d cronview.json http://localhost:5984/cronologger/_design/cronview
{"error":"bad_request","reason":"invalid UTF-8 JSON"}

I checked and rechecked JSON, used the same JSON using CouchDB's Futon to no avail. Finally I found the answer here

http://stackoverflow.com/questions/2461798/error-about-invalid-json-with-couchdb-view-but-the-jsons-fine

The -d option of curl expects the actual data as the argument!

If you want to provide the data in a file, you need to prefix it with @:

curl -X PUT -d @keys.json  $CDB/_design/id

Store your cron output for analysis and correlation with cronologger

July 6th, 2010

For the longest time I have wanted to get rid of dozen or so cron messages I receive every morning about things like DB backups, DB cleanups/vacuums, reporting etc. There are a number of solutions out there to help you manage the cron spam such as cronic, shush and cronwrap. They help by e-mailing you only if there is a problem however don't store the cron output itself. To get around that issue I have developed cronologger which can be downloaded from

http://github.com/vvuksan/cronologger

Cronologger is a BASH script that stores all the cron output into a database. I am using CouchDB since it is a great document oriented database that allows me to add attachments (blobs) to a document. I assume it would not be hard to use MongoDB, Riak and others.

Some of the benefits of this utility are

  • Reduce cron spam
  • Provide the ability to correlate adverse affects by overlaying cron events on e.g. Ganglia graphs
  • Provide a better report of all the batch jobs that ran, diff them with past jobs if they should look the same, etc.
  • Provide the ability to easily view what is currently running on the whole infrastructure ie. job_duration < 0
  • Review historical output

I am still working on web GUI for most of these things. I will gladly accept patches and new contributions.

Tip: To get view a list of documents in a CouchDB database you can use the _utils view e.g. http://localhost:5984/_utils/

Overlay deploy timeline on Ganglia graphs

June 28th, 2010

Don't you sometimes wish you could have a visual indicator of when code has been deployed in production. Something like this :-)

Shows deploy time line on a load graph

This is how you can add deploy timeline to your Ganglia graphs or for that matter to any tool that uses RRDs such as Cacti, Munin, Collectd etc.

Background

RRDtool supports so called VRULEs which are

VRULE:time#color[:legend][:dashes[=on_s[,off_s[,on_s,off_s]...]][:dash-offset=offset]]

Draw a vertical line at time. Its color is composed from three hexadecimal numbers specifying the rgb color components (00 is off, FF is maximum) red, green and blue followed by an optional alpha. Optionally, a legend box and string is printed in the legend section. time may be a number or a variable from a VDEF. It is an error to use vnames from DEF or CDEF here. Dashed lines can be drawn using the dashes modifier. See LINE for more details.

What we want to do is add a VRULE for each deployment. For example those three lines above have been generated using these VRULEs

VRULE:1277731886#FF00FF:"Deploys" VRULE:1277721886#FF00FF VRULE:1277711886#FF00FF

Implementation

Easiest way to add these to Ganglia is to modify graph.php in Ganglia Web. You need to look for following two lines at the end of the file

$command .=  array_key_exists('extras', $rrdtool_graph) ? ' '.$rrdtool_graph['extras'].' ' : '';
$command .=  " $rrdtool_graph[series]";

Then append your own VRULEs ie.

$command .= " VRULE:" . $time . "#FF00FF:\"Deploys\"";

Obviously you have to pull in the $time info from where you keep track of your deploy times. You can also get creative by using different colors for different deploys, change legend labels, add VRULEs to only certain graphs ie. load, CPU etc. This is a quick and dirty way to do it

$deploy_times = array(1278082860,1279393200);
foreach ( $deploy_times as $key => $time ) {
  # Put deploys label only once.
  if ( $key == 0 )
     $command .= " VRULE:" . $time . "#FF00FF:\"Deploys\"";
  else
     $command .= " VRULE:" . $time . "#FF00FF";
}

Now you just have to make sure you append deploy times in the array.

Alternate implementations

Alternate implementation is to create a RRD file whenever you do deploys then overlay that graph on top of an existing graph. Trouble is you have to worry about scaling the graph. Never could get it quite right.

Credit

Thanks goes to the Circonus guys :-) since they made me think of vertical lines instead of trying the RRD overlay. Also thanks to @toredash for pointing me in the right RRDtool direction by suggesting HRULE.

Velocity Conference 2010 takeaways

June 27th, 2010

Velocity 2010 was an excellent conference. Following are my takeways from the conference. There is tons more but following are some of the things that made a good impression and are likely not hard to do

Web performance optimization

Mobile performance optimization

Most of the recommendations have been taken off Maximiliano Firtman's Mobile Web High Performance. You can view slides here.

  • Avoid JQuery unless you really need it. Check out slide 90. It takes 1.8 seconds on iPhone and 4 seconds on Android to download and parse JQuery. Use mobile optimized frameworks such as baseJS and XUI
  • Avoid DNS lookups and minimize number of requests since they are slow
  • Embed CSS and Javascript on the home page. After onload download external CSS and JS.
  • Use inline images (slide 56) and pictograms
  • Avoid redirects
  • Use native constructs especially for Webkit browsers e.g. -webkit-text-stroke
  • Keynote announced their Mobile Testing tool for desktops that looks promising http://mite.keynote.com/

SSL/Security

  • According to Google SSL overhead these days is pretty minimal. Around 1% on today's servers.
  • Pet peeve about the presentation is they were advising everyone to use less secure key lengths ie. 1024 bits and RC4 cipher to improve performance. It is true that adding SSL to insecure connections is certainly an improvement but it should be qualified. E-mail probably fine. Financial sites probably bad.

Scalability

  • Hidden Scalability Gotchas in Memcached and Friends by Neil Gunther (author of Guerilla Capacity Planning) and Shanti Subramanyam discussed their findings around memcached. They used quantitative analysis to analyze different memcache versions. Based on their analysis using Neil's model memcache 1.4.5 has higher contention than 1.2.8.

Culture

GangliaView – automatically rotate Ganglia metrics

June 16th, 2010

GangliaView is a simple web app that allows you to automatically rotate selected Ganglia metrics. We use it to rotate key metrics with large graphs showing last hour and last day and smaller graphs showing last week and last month. A sample screen looks like this

GangliaView is derived from CactiView with a number of changes to make it work with Ganglia and removal of frames. You can download it from here

http://github.com/vvuksan/ganglia-misc

Non-Dell SSDs/drives not supported until Q2 2011

June 16th, 2010

I am writing up this post so perhaps I can save some poor sysadmin from chasing their own tales. If you ever receive following error message using PERC H700 or H800 controllers

Jun 15 14:00:17 db07 Server Administrator:  Storage Service EventID: 2335  Controller event log: PD 04(e0x20/s4) is  not supported:  Controller 0 (PERC H700 Integrated)
Jun 15 14:00:18  db07 Server Administrator: Storage Service EventID: 2334  Controller  event log: Inserted: PD 05(e0x20/s5):  Controller 0 (PERC H700  Integrated)
Jun 15 14:00:18 db07 Server Administrator: Storage  Service EventID: 2335  Controller event log: PD 05(e0x20/s5) is not  supported:  Controller 0 (PERC H700 Integrated)

It is due to following

http://www.standalone-sysadmin.com/blog/2010/04/dell-reverses-position-on-3rd-party-drives/

Please note this will not be fixed until Q2 2011.