Saturday, August 7, 2010

Nagios with check_mk

We don't do a lot of printing here at the homestead, but when we do we usually find that the printer isn't working. It is an older network printer that is a bit flaky. So I obviously needed to install Nagios an enterprise class monitoring package to let me know when it stopped working. I had a centOS machine lying around that I wasn't doing anything with, so one rainy Saturday afternoon I set to work. I followed the quick start instructions for fedora without encountering much trouble. I had a couple dependency issues (no apache) but was able to get everything through yum easily an aside yum has really made life easier; I remember the days where if you forgot to select gcc during the install you simply reformatted and started over because it was easier.

I also wanted to be able to do some trending (not on the printer mind you that would be silly). It seems like some people are using different products for trending and just Nagios for the monitoring; but Nagios does a great job of trending as well with help from an add-on. Of the various add-ons for Nagios that do trending pnp4nagio seemed to fit me best, it is easy to use and generates appealing graphs. As long as the Nagios check returns data properly pnp4nagios can graph it with no additional setup.

So it was time to setup my first host. This is where I think people start to get turned off by Nagios. For every host you have to edit a file with all the services the host has. This means thinking about what you want to monitor, having some idea of what Nagios can monitor, and when adding new monitors going back and updating potentially many config files. There are some plugins that will let you do this in a web GUI but I didn't try any so cannot speak to them. Despite the difficult config Nagios is nice; the web interface is clear, I could see that my file server and printer were both up, and I got nice alerts when my printer went offline.

I then set out to investigate check_mk which solves the difficult configuration of the monitored hosts which is Nagios's biggest negative. Check_mk is an auto discovery tool for services on a host. You tell it what host machines you want to monitor, install a small client on the host and it takes care of the rest. Anything check_mk can monitor it will find it on the host and monitor it; it also has built in integration to pnp4nagios so you can trend all of the check_mk monitors as well. You will still need to deal with figuring out how to monitor other stuff that is not at the OS layer (custom applications, database, etc) but you will be building off a solid foundation.

Installation of check_mk was very simple NOTE You will probably want to grab the latest version as the new 1.1.6 is out now (I admit I procrastinated a while before writing this post but the steps should otherwise be the same):
wget tar -xvf check_mk-1.1.2.tar.gz
cd check_mk-1.1.2
When you run the setup script it asked me lots and lots of questions, I took the defaults for all of them. To install the client on the target machine to be monitored:
rpm -i check_mk-agent-1.1.2-1.noarch.rpm
Next add the host name of the machine to be monitored to the check_mk config file (/etc/check_mk/, and run two commands to auto-generate the Nagios host config files:
check_mk -I alltcp
check_mk -R
These commands will also restart the Nagios server. One note is that if you are running on a slow host (like hardware sitting in your basement) sometimes 2 Nagios processes will be running at the same time, kill them both (killall -9 nagios) and restart nagios (service nagios start) and you should be fine.

Browse over to the web UI and you should see your monitored host, click on it and you should see a healthy amount of monitored services. To add additional hosts simply repeat the last two steps from above.

With the combination of Nagios and chck_mk you get the great monitoring server of Nagios without the headache or learning curve that is the traditional complaint of Nagios users. Skip the NRPE or remote shell invocation stuff and go straight to check_mk. You can add non check_mk services to these hosts (and other things) through standard Nagios configuration files. You will want to add a new "cfg_file" property to your nagios.cfg file to hold custom configurations, in this new file define a new check using the same host name used to setup check_mk. When done you should see your service along with the check_mk ones, and because it is in a separate file check_mk will not overwrite it when doing updates. Nagios may also be a bit chatty in the beginning, so even with check_mk it still takes a little tuning so alerts are not going off all the time; though you may also be surprised to discover a number of problems on your network if this is the first time you are setting up a non-home grown monitoring solution.

For additional reading check out the jboss2nagios plugin for JBoss monitoring and icinga a recent fork of Nagios. I didn't get a chance to look into icinga much as I didn't find it until I was already down the path with Nagios. Initial thoughts were the web UI looks much more modern but I don't think check_mk works with it yet.

Sunday, June 6, 2010

xen time drift and ntp

Just a quick post about how I coerced some xen machines that could not keep time to keep time, maybe someone will comment on a better way or this will help someone else. I believe the problem crops up when you may have more guest machines than processor cores in the host, but regardless the symptom is massive time drifting on the DomU machines (the guest VMs). I had a machine that was counting to 5 whilst everyone else was counting to 4 (1 on thousand 2 one thousand 3 one thousand 4 one thousand....). Over the course of an evening this could get it way out of synch to the point of becoming more than just an annoyance.

Googling the issue there seem to be many people with similar issues but there didn't seem to be a solution out there other than to dig into what the issue is with the xen machines clock; there I found everything from my bios clock is wrong (not my problem) to advise about the interrupts and how they can be flaky when virtualized. One clever guy had even figured out the correct tick ratio to get close enough to real life so that ntpd can adjust the clock correctly using tickadj. This seemed like it might be fragile in my case as I am not sure the clock is predictably wrong, just that it is wrong.

After much digging into how ntp works I found a solution that works for me running CentOS 5 and xen 3. First you need to install ntpd and set it to run on startup:
yum install ntp
chkconfig ntpd on

Next you need to take the guest machine off of the hosts clock. The permanent way of doing this is by adding a line to your /etc/sysctl.conf file to tell the machine to not use the host machines clock (changes will take affect on restart):
echo "xen.independent_wallclock = 1" >> /etc/sysctl.conf

Finally you will need to tinker with ntp so as it does not panic when the time shift is too large and to ignore the jittering that may happen if you have a wilding shifting clock (as was my case). The panic should be for any VM using ntp and not just wacky ones that drift on time, this is because if they are suspended ntp will see the time shift is more than 1000 seconds and not adjust the clock. To make these changes edit with the default ntp.conf file. which should be in located at: "/etc/ntp.conf". You need to remove or comment out 2 lines:
server # local clock
fudge stratum 10
next add the following line to the very top of the file; it will not work unless this line is the first directive in the config file.
tinker panic 0 dispersion 1.000
Ideally you should have 1 server setup as a master ntp in your network that syncs off of public ntp server and all of your other servers synch off this. Although this is a ntp best practice and not really specific to this problem with the new setup the fact that I am already tinkering with ntps default values I think it is even more important. Below find a sample ntpd.con file that is based off the default with my changes in it:
#dispersion 1.000: Ignore high jitters and offsets as local clock dirfts wildly on xen
#panic 0: set time even if time shift is more than 1000 seconds
tinker panic 0 dispersion 1.000
# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery

# Permit all access over the loopback interface. This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict -6 ::1
server ntp01

# Drift file. Put this in a directory which the daemon can write to.
# No symbolic links allowed, either, since the daemon updates the file
# by creating a temporary in the same directory and then rename()'ing
# it to the file.
driftfile /var/lib/ntp/drift

# Key file containing the keys and key identifiers used when operating
# with symmetric key cryptography.
keys /etc/ntp/keys

Finally reboot. The reboot takes care of making all the changes active but if you cannot reboot you can try to set the vm to not use the the clock of the host machine and restarting the ntpd service. The independent_wallclock bit did not work for me so your millage may vary.
echo 1 > /proc/sys/xen/independent_wallclock
service ntpd stop
service ntpd start

Monday, April 26, 2010

Gantt Project and Bugzilla "Integration" part 2 with gpath

A number of people (okay one person) asked for the code behind my bugzilla gantt project update script. So here is a re-written and simplified script. I tried out the gpath notation within groovy as well, which is quite nice. It is very quick and natural to code in once you get the hang of it. One thing that I found lacking documentation was how to take a groovy node and output it as xml here is a code snippet that renders the Node variable "theNode" into a file called "test.xml" which I hope is helpful:
def formatter = new XmlNodePrinter(new PrintWriter(new File("test.xml")))
formatter.setPreserveWhitespace true

Here is the groovy script to update a gantt project file from a Bugzilla database. You will need to update the section under "user defined variables" with your project name, the new project file name, and connection information to your Bugzilla database. The script requires that you map your tasks to Bugzilla bugs with a custom column (called "Bugs" by default) and that you have a custom column called "Bug Status", this is where the bug status information will be placed (it will overwrite whatever is in there now). Other than that, add the mysql driver to your class-path, make sure you have groovy installed and away you go.
Again this is a rudimentary solution but....well read my previous post for that story.

import groovy.sql.Sql

//user definable variables
def sql = Sql.newInstance("jdbc:mysql://HOST/BUGZILLA USER", "BUGZILLA USER","BUGZILLA PASS", "com.mysql.jdbc.Driver") //enter params to connect to bugs database
def filename = "proj.gan" //file name of the source project plan
def target ="rev_proj.gan"//target name for the new project plan
def bugPropName = "Bug" //what custom column the "bug" number is in
def bugStatusPropName = "Bug Status" //what custom column the "bug" number is in

//parse the file so we can use gpath
def xml=new File(filename).getText()
def root = new XmlParser().parseText(xml)

def bugProp
def bugStatusProp
//map the bug properties into their internal ids
for (aTaskProp in root.tasks.taskproperties.taskproperty) {
}else if(bugStatusPropName.equalsIgnoreCase(aTaskProp['@name'])){

if(bugProp == null || bugStatusProp == null){
println("Must have a custom property setup called 'Bug' and one for 'Bug Status' in your project")

//closure used to set the custom property
def setOrAddCustomProperty = { taskNode, propId, value ->
//loop through again from the begining if we found the bug and look for the
//property we need to set. It may be before the bug, order is not guaranteed
foundIt = false
for (aCustompropertySub in taskNode.customproperty) {
aCustompropertySub['@value'] = value
foundIt = true
//Add in the property to the XML if we didn't find it
new Node(taskNode,"customproperty",["taskproperty-id":propId,"value":value])

//Update the tasks with bugzilla info (status)
for (aTask in root.tasks.task) {
//loop through the customproperty elements where we store bug numbers and bug status
for (aCustomproperty in aTask.customproperty) {
def bugNum = aCustomproperty['@value']
//This is where you would add additional updates from Bugzilla
def bugStatus=""
sql.eachRow("select bug_status from bugs where bug_id = ${bugNum}") { row ->
bugStatus = row[0]
setOrAddCustomProperty(aTask, bugStatusProp,bugStatus)

//output the xml to the file specified in the user set variables
def formatter = new XmlNodePrinter(new PrintWriter(new File(target)))
formatter.setPreserveWhitespace true

Sunday, March 21, 2010


Today I had a chance to play with skipfish, which is a web vulnerability scanner. On my machine running OSX 10.5 I had a couple things I needed to do to get it running. I had to install libidn using mac ports, modify the Makefile, copy the dictionary and then make everything:
sudo port install libidn
cp dictionaries/default.wl skipfish.wl
I got an error during make "report.c:744: warning: passing argument 3 of ‘scandir’ from incompatible pointer type" which was fixed by editing the make and changing the line:
CFLAGS_GEN = -Wall -funsigned-char -g -ggdb -D_FORTIFY_SOURCE=0
change to:
CFLAGS_GEN = -Wall -funsigned-char -g -ggdb -D_FORTIFY_SOURCE=0 -I/opt/local/include -L/opt/local/lib
After typing make again Skipfish was compiled and working.
I needed form based authentication for my scan which has been tricky with other web scanners I have tried; skipfish was pretty easy and had a novel concept. Once you have logged into whatever site you are trying to test with a web browser, find your session cookie and pass it with the "-C" switch to skipfish. For example:
./skipfish -C JSESSIONID=MYSESSIONID1234 -X /logout.jsp -o /tmp/outputDir http://localhost:8080
Which will scan localhost using an existing session identified by MYSESSIONID1234 and will ignore any link with logout.jsp (so as not to destroy the session). The html report will be generated in the /tmp/outputDir folder.
The output is clean but uninformative, so you may need some hints to be able to do anything useful with it. Skipfish looks at fundamental problems anyway (SQL injection as opposed to say the existence of some specific DLL or known apache bug) so specific solutions are not appropriate. All in all a very easy to use useful tool.

Sunday, February 14, 2010


The other day I set out to find a mail server that would accept all mail coming to it and not then display it back to whoever wanted. The idea was so that instead of setting up fake email address for people to check the results of various different black box testing scenarios we could all see the email that was generated from the application. I was 100% sure that such a thing existed, and honestly surprised when I found that it did not (at least I could not find anything).
There is a fair amount of software out there that comes close to what I want, dumpster and wiser for unit testing and a myriad of tools (papercut looked like a neat one but I never tried it having a mac and all) that will sit on a desktop and intercept email; but nothing that was suitable for a server used by multiple people.

I decided therefore to make one. I had been looking for a small grails project to do in order to really understand how grails would work beyond a tutorial level understanding. A couple weekends later and mockemail is in a useable albeit rough around the edges form. It is open source and runs on the grails framework. I have deployed it and we are using it internally and nothing has crashed so at least in this I am successful. I can also say it was a nice departure to actually develop something for a change, even on a small scale.

Mockemail is a web based server that is very easy to setup and run. You download it and turn it on (maybe change the port number to 25). Any mail that is sent to it will be stored locally and not delivered to the final recipient. No checks are done against the address, it simply accepts it and stores it. The mail can be then be viewed and displayed back to anyone who is interested with no need to log out and log in, in other words any email that the server receives is viewable by anyone who wants to see it. It is open sourced and has been made available here for free on source forge in the hopes that someone else finds it useful.

In a later post I hope to talk about what I did and the problems I ran into while building it.

Saturday, January 9, 2010

Diary of setting up an Ubuntu Enterprise Cloud

Below is my experiencing with getting an Ubuntu Enterprise Cloud setup. This is really a Eucalyptus cloud setup that is packaged up by Ubuntu, so I may use the terms somewhat interchangeably.
Day 1
I requisitioned 2 servers and Installed ubuntu server 9.1 and choose the Enterprise cloud option from the main installation screen. I did this once without an Internet connection and had all kinds of problems with exchange the keys with the cloud controller and the node:

warning: //var/lib/eucalyptus/keys//node-cert.pem doesn't exists!
warning: //var/lib/eucalyptus/keys//cluster-cert.pem doesn't exists!
warning: //var/lib/eucalyptus/keys//node-pk.pem doesn't exists!
warning: //var/lib/eucalyptus/keys//cloud-cert.pem doesn't exists!

Trying scp to sync keys to: eucalyptus@!://var/lib/eucalyptus/keys/...
usage: scp [-1246BCpqrv] [-c cipher] [-F ssh_config] [-i identity_file]
[-l limit] [-o ssh_option] [-P port] [-S program]
[[user@]host1:]file1 ... [[user@]host2:]file2

ERROR: could not synchronize keys with !
The configuration will not have this node.
Hint: to setup passwordless login to the nodes as user eucalyptus, you can
run the following commands on node !:
sudo -u eucalyptus mkdir -p ~eucalyptus/.ssh
sudo -u eucalyptus tee ~eucalyptus/.ssh/authorized_keys > /dev/null
Be sure that authorized_keys is not group/world readable or writable

So I followed the above directions and did many other things without any luck, eventually after reinstalling both the node and cloud controller machine while they were connected to the internet and then doing a distro update and reboot:
sudo apt-get update
sudo apt-get dist-upgrade
the key exchange process worked. I did this a couple times and had to do some or part of the above directions to get it to work.
With everything appearing to work I tried to get elastic fox to work so I could do something useful with my cloud; however I could not get it to accept a new region. I believe this is related to a recent update where amazon was adding some features that made it incompatible with Eucalyptus, though perhaps this will be addressed soon. I did eventually find a fork of the elastic fox called Hybridfox that is made to work with Eucalyptus but for now I was relegated to the command line tools to try and get an image up and running. The ubuntu website made mention of a "register-uec-tarball" script which when used with one of their pre-supplied images would register it with the cloud controller. I found the script at:
Unfortunately it didn't work out of the box and I was not really interested in fixing it.

./ euca-centos-5.3-i386.tar.gz centos i386
Mon Dec 21 20:08:16 EST 2009: ====== extracting image ======
can't find image
cleaning up /tmp/

One note is that if you browse through the instance store through the web interface you can just download an image with a click of a button and that seems to work without problems, but you are limited to only what is available through the image store which was a beta media server and 2 different version of ubuntu and I had my heart set on centOS.
I did manage to get my image to register using the command line instructions provided at scroll down to "STEP 7: Run an Image". The steps refer to a non-existent EMI environment variable in step 3. This can be retrieved from the management console or from the euca-describe-images command (it will be the one with image.manifest.xml in the file name).
Bellow is what the commands looked like for my install. Note if you are trying to replicate this your would be different based off the image you are trying to install (again I was doing centOS) and the EKI and EMI values will be different.
gunzip centos.tar.gz
tar -xvf centos.tar
cd euca-centos-5.3-i386/
euca-bundle-image -i kvm-kernel/vmlinuz-2.6.28-11-server --kernel true
euca-upload-bundle -b centos-kernel-bucket -m /tmp/vmlinuz-2.6.28-11-server.manifest.xml
euca-register centos-kernel-bucket/vmlinuz-2.6.24-19-xen.manifest.xml
(set the printed eki to $EKI [word after IMAGE export EKI=eki-41CD162F in my case])

euca-bundle-image -i kvm-kernel/initrd.img-2.6.28-11-server --ramdisk true
euca-upload-bundle -b centos-ramdisk-bucket -m /tmp/initrd.img-2.6.28-11-server.manifest.xml
euca-register centos-ramdisk-bucket/initrd.img-2.6.28-11-server.manifest.xml
(set the printed eri to $ERI [word after IMAGE, export ERI=eri-A4B3177E in my case]))

euca-bundle-image -i centos.5-3.x86.img --kernel $EKI --ramdisk $ERI
euca-upload-bundle -b centos-image-bucket -m /tmp/centos.5-3.x86.img.manifest.xml
euca-register centos-image-bucket/centos.5-3.x86.img.manifest.xml

You should get back an image id and be able to see it in the cloud manager. Now use this image id to create the server:
euca-run-instances emi-E5C0150D -k mykey -t m1.small
Done! I was ready to break out in a jig when instead of getting my VM I get this message:
FinishedVerify: Not enough resources: vm instances.

Day 2
I found the “euca-describe-availability-zones verbose” command which when run shows I have no availability (i.e. no Nodes running). One note is despite some googling of output where it looks like it shows the registered nodes I never saw this in the output. I am not sure if there are just a different versions of this command or if in some cases it does and others it does not output node information. In the end (I am spoiling the ending but narrators are allowed to be omniscient) I couldn't find anything that would actually tell you what nodes really existed and what state they were in. So given it was a new day and my lack of availability of resources I decided to start over with a newly requisitioned laptop for my node that had a little more horsepower.

First I reinstall the controller, next I do the Node. Now I am back to my original problem with the key synchronization, so I do the apt-get dist-upgrade, run through the whole synching keys process and it completes without error. Cross my fingers and run “euca-describe-availability-zones verbose” still showing no availability…logs showed nothing obvious, but I am not sure what I am looking for so that doesn't mean very much coming from me. Posted a message on the forum for help. I never did get a reply but I had a theory that perhaps my hardware was not up to the task. I requisitioned a laptop that I knew had the intel VL extensions and reinstalled everything again with no luck. I posted a message to the Eucalyptus forum and went home.
Day 3
I got a response on the eucalyptus forum and tried their suggestion of opening up ports but this didn't help. I requisitioned a new laptop for my testing environment to use for the cloud controller (we had ordered a new laptop to replace one that was getting old and didn't have a working delete key. "Just don't make any mistakes for a couple days and you won't need the delete key" I said as I stole off with the machine.) I reinstall the cloud controller and now get stuck on installing the credentials. Do the dist upgrade and thankfully this goes away. Reboot the cloud controller, then the node so they come up in the right order and try to see what my availability is now and get a strange error instead:
Failure: 408 Request Timeout
Thankfully this turned out to be something simple. There appears to be a known bug where on a new boot things may come up in the wrong order. The work around is to restart the eucalyptus service on the node:
sudo service eucalyptus stop
sudo service eucalyptus start
Now back on the cloud controller I check the availability and finally it is registering the node. I now try to start up a new instance:
euca-run-instances emi-E5AC1512 -k mykey -t m1.small
Then with the handy dandy watch command:
watch -n5 euca-describe-instances
I was able to watch it terminate right after it started up. Not the desired affect. I remembered at this point reading somewhere that you may need to do some bios tinkering to get the virtualization settings enabled. So I shut everything down go into the bios and enable virtualization on both the cloud controller and the node. Start everything back up and try it again. By now I had found Hybridfox and was able to use their GUI to deploy my new instance. One note on Hybrid fox is that on the directions it says to use the "Query Id" and "Secret Key" I took these to be just dummy values but you need to get these by logging into the cloud controller web interface and clicking on show keys at the bottom of the login screen.
I still needed to do some work to get the networking all working, but there was my VM in my private cloud happily running.
Although I can't say this was easy to get up and running it is a frightfully nifty piece of tech. You need to have both the Cloud Controller and the Node to have VL extensions and turned on so you need a certain level of hardware to play with it (oh and the laptop with the missing delete key did in fact get replaced). Eucalyptus has re-implemented much of the Amazon cloud infrastructure, and not just the EC2 part. There is a commercial company behind it as well for those that need that level of support and piece of mind.