Quantcast

Archive for September, 2009

Nagios alerts based on Ganglia metrics

Monday, September 14th, 2009

Have you ever wanted to alert based on Ganglia metrics. Well you can :-)

You can find the source code here for the plug in here.

Instructions how to set it up are here.

Software doesn’t run itself

Sunday, September 13th, 2009

Perhaps I should no longer be surprised but I am by the article mentioned in this blog post

http://www.nakedcapitalism.com/2009/09/another-lehman-mess-no-one-can-run-the-software.html

In particular this

Once it went bankrupt, the staff who supported these systems “evaporated”, according to Steven O’Hanlon, president of Numerix, a pricing and valuation company which is working with Lehman Brothers Holding Inc to unwind the derivatives portfolio.

These days computer systems are the blood of your company so allowing critical technical staff to simply "evaporate" is mind boggling. Granted company imploded but still I would think that someone should have figured out going into bankruptcy that they should set aside money to pay for their maintenance.

Ultimate problem as pointed out in the blog post on Naked Capitalism that documentation is usually skimped on since it "doesn't provide value". Although I would also add that when people say "code is documented" they don't usually mention their systems infrastructure is documented. That can sometimes be even bigger impediment. At a previous job there was a Perl CGI script that most people didn't know about and even fewer understood. If that script didn't work our whole load balancing infrastructure would "mysteriously" fail since app servers wouldn't register themselves to web servers and leading to a full blown outage. It was such an obscure "feature" that you could literally spend weeks chasing other avenues since this was so non-obvious.

Also I would not take comfort in having source code to an application. Lot of customers of startups will write in their contracts that if a startup goes bust they get access to the source code. That may sound nice but it doesn't mean you will necessarily be able to run it. There are so many "secret" recipes, undocumented workarounds that are often involved in running most complex pieces of software that you should really be cautious.

In closing if you care that your software runs make sure you keep at least couple folks who have run it around.

http://www.nakedcapitalism.com

/2009/09/another-lehman-mess-no-one-can-run-the-software.html

Simple “web service” for Ganglia metrics

Friday, September 11th, 2009

Here is a simple PHP script to allow you to get current Ganglia metrics. You will need Ganglia web installation. Drop this script somewhere. Then invoke it via e.g.

http://mygangliaserver/ganglia/metric.php?server=web1&metric_name=load_one

Where server is the name of the server for which you want metrics and metric_name is the exact name of the metric you are looking for e.g. load_one, disk_free etc. Only thing that is returned is either ERROR message or actual value.

<?php

$GANGLIA_WEB="/var/www/html/ganglia";

include_once "$GANGLIA_WEB/conf.php";
include_once "$GANGLIA_WEB/get_context.php";
# Set up for cluster summary
$context = "cluster";
include_once "$GANGLIA_WEB/functions.php";
include_once "$GANGLIA_WEB/ganglia.php";
include_once "$GANGLIA_WEB/get_ganglia.php";

# Get a list of all hosts
$ganglia_hosts_array = array_keys($metrics);

$found = 0;

# Find a FQDN of a supplied server name.
for ( $i = 0 ; $i < sizeof($ganglia_hosts_array) ; $i++ ) {
 if ( strpos(  $ganglia_hosts_array[$i], $_GET['server'] ) !== false  ) {
 $fqdn = $ganglia_hosts_array[$i];
 $found = 1;
 break;
 }
}

if ( $found == 1 ) {
 if ( isset($metrics[$fqdn][$_GET['metric_name']]['VAL']) ) {
 echo($metrics[$fqdn][$_GET['metric_name']]['VAL']);
 } else {
 echo("ERROR: Metric value not found");
 }
} else {
 echo "ERROR: Host not found";
}

?>

Nothing fancy. It contains rudimentary error checking so please be gentle :-) . Feel free to extend it satisfy your needs. Also this is likely not scalable if you have hundreds of hosts and tons of requests.

Broken hostname resolution and PAM don’t mix

Wednesday, September 9th, 2009

I don't mean PAM the cooking spray but Pluggable Authentication modules. I was asked to change some DNS settings for a set of hosts ie. move them from one domain to another e.g. from them being in domain.com to be in domain.net. At the end of the process head node all of the sudden started refusing logins with following error message

fatal: Access denied for user vvuksan by PAM account configuration

It took some hair pulling but after a while I concluded that the headnodes hostname was set to the old name e.g. server5.domain.com which was no longer resolvable. As soon as hostname was changed ie.

% hostname server5.domain.net

Things automagically started working again. Hope this prevents someone from going bald :-) .

Cloud computing’s Achilles Heel

Tuesday, September 1st, 2009

I have touched upon this issue before however here are some illustrations of what I think is cloud computing's Achilles heel. It has to do with shared hardware and virtualization. In my case I have a Drupal site running in a Xen guest running on top of a Xen host. For whatever reason while being indexed by a Google bot Apache went "crazy" allocating tons and tons of memory and swapping like crazy. At this point the Xen guest is nearly unusable since the load is close to a 100.

xen-guest

Now let's look at what is happening to the underlying Xen host ie. one that runs the Xen guest

xen-host

Yikes. If you had another instance on this particular Xen host you can bet that instance would be severly affected. The trouble is that you may not be really aware of it since you do not have access to the underlying hardware. You may be scratching your head why all of the sudden you are getting subpar performance. Also if you are a cloud provider how do you deal with situations like this ? Do you simply shut down machines that exceed certain performance thresholds ? What if this happens to be a production database server which is doing a database dump and should be "allowed" to thrash the disk ? What if you shut it down and you corrupt customers' database ? It gets real tricky real quick.

Also forget about oversubscription. You need one poorly behaving guest to ruin it for everyone else. Although more you oversubscribe more the risk of performance degradation.