Monitoring and Configuration Management to restart services20 Nov 2011
This is the text of the talk I gave at the LSPE meetup in November 2011.
Good Evening. Tonight I’m going to talk about something that I hope will free up your time from firefighting during the day and help you sleep better at night instead of catching up on your sleep during thesetalks. I’m going to talk about setting up your servers so that they can recover from faults automatically, without your intervention. And once you’ve followed this self- healing recipe you’ll be freed from firefighting your most common system failures.
My name is Greg Retkowski, and I’m an Operations Engineer at OnLive a cloud gaming startup. I’ve sysadmin’ed at a dozen or so internet startups around the bay area since moving out here in 1997. I’m going to talk today about the self-healing setup I used at another startup I worked at; a company called Avvenu. This setup ties together your network monitoring and your configuration management system so that common faults that your monitoring system detects can be quickly fixed by your configuration management system.
This talk is based on an article I wrote a few years ago for Oreilly. Originally it used NAGIOS and Cfengine. As most people are more familiar with puppet I’ve updated this talk to use it instead. With just a few small changes to tools you’re already using you’ll be able to tie these two systems together to resolve faults as they occur.
What’s in it for me?
So, how does this setup help you?
First It’ll free you up from firefighting interruptions. I didn’t like getting paged late at night and having to fix common problems that a configuration management system could rectify. For example, we had some custom apache modules that’d sometimes crash the apache daemon. I’d have to VPN in and restart apache which would be easy to automate.
Second It can react faster than a human can. Once I implemented it, a pleasant side effect was that the setup would resolve common issues even faster than a human could. When I was paged I’d login and find that the system had already self-corrected. Failures were shortened by removing a human from the loop.
Third It’s a hedge against technical debt. We don’t want to get software from engineering that crashes every millionth request, but sometimes we do, and something like this can get us through till the next release.
The first tool is NAGIOS. Most people in this room are already familiar with
NAGIOS. Its the most popular open-source monitoring
package. It runs service checks against services and notifies sysadmins when things fail. It has the capability of running an external script when a failure occurs, and we’ll leverage that in our setup.
The next tool is Puppet. When I mention puppet some people think of a scarey
doll with strings but in this case I’m talking about
the configuration management system developed by Luke Kaines. It will check the configuration of a host against policies you create, and will update the host to match your policies. Puppet has many capabilities that help in our system, it can correct corrupted config files, can fix directory permissions, and can ensure processes are running. In most installations puppet runs only, say, twice an hour. However it can be run on demand, and run remotely, and we’ll use this capability in our setup.
As an aside, if there’s one skill that’ll be crucial to have in the next five years it’s going to be a familiarity with configuration management systems be that puppet or chef. It’s nearly impossible to manage large server farms without them. If you haven’t investigated either of them yet I recommend you do.
High Level diagram
This is a high level diagram of how it is all tied together. Nagios monitors services, and when a falt occurs it triggers an RPC mechanism to tell puppet to run. Puppet is configured to ensure apache is running and if it isn’t it restarts it via its init script.
Setting up puppet to start downed services
This is our puppet policy for our apache server. This may look like giberish if you aren’t familiar with puppet. I’ll quickly walk through it..
This class ‘httpd’ tells puppet how to configure apache on our system. It says we should have the apache package installed, there’s a bunch of config file definitions we skip in this example, and then at the bottom we tell puppet that the service ‘apache’ requires the ‘apache’ package and that puppet should ensure it is running.
When puppet runs, if it finds the apache process is missing, it’ll restart it via the apache init script. This will be how puppet will restart apache if nagios notifies it that it isn’t running.
Setting up nagios with a postfail script
Now we’ll talk about the changes to Nagios. To make this work you’ll need to
configure nagios to call an event handler script whenever a service goes into
a different state. You’ll need to make changes in two places. First the
services config file:
The important lines in this file are the ‘event_handler_enabled’ and the ‘event_handler’ lines. The first tells nagios to turn on an event handler for state changes for this service. ‘event_handler’ tells nagios what event handler to use.
The next file to update is the commands config file - we add a command handle_puppetrun, which invokes the handle_puppetrun shell script with several arguments. This will tell the script what host is affected and what the service state is. I’ve wrapped the lines here, but the command_line line must be all on the same line.
Setting up the glu between nagios and CM
Now we are going to set up the RPC glue between Nagios and Puppet. All these examples are with Puppet 2.6
Setting up the puppet daemon on the host
First you’ll need to edit puppet’s auth config, and add a stanza that will allow it to accept remote requests to kick off puppet runs.
You’ll also need to create an empty namespaceauth config this is a known issue with 2.6.
Next you’ll need to tell puppet to listen for incoming requests. You can do this by adding a stanza like this to your main puppet config.
You could also use the ‘listen’ flag on the command line when invoking the puppet agent.
For testing this you should use this command line. This will cause the puppet agent to stay in the foreground and print debugging information to your console. I recommend you run it like this at first while debugging your configuration. It’ll print all the logging information to the console which makes troubleshooting much easier.
Testing puppetrun from your monitoring host
You should now try invoking puppetrun from the command line, as the nagios user. This will ensure the end-to-end communication is working and that your nagios server will be able to fire off puppet when it needs to. If you are running the puppet agent in debug mode on your apache server you should see it running through its configuration.
When this isn’t working it’s often because the puppet certs aren’t issued to both hosts, or that the users running the commands don’t haveaccess to the certs. Check both of these if you have trouble. In my installation I added an entry to the sudoers file so that nagios can invoke puppetrun as root to have access to the certs.
The puppetrun invocation script
Once you are satisifed that puppet is communicating propery; You’ll need the last piece the handle_puppetrun shell script. It goes into the Nagios plugins directory. Once it’s in place, make sure it is executable via the NAGIOS user.
Nagios calls the script on all state changes. So the script looks for critical errors (here) and either HARD failures or three SOFT failures (here). In either case it calls puppetrun with the remote hostname this causes puppet to run on that remote host.
And this is how it works once it is deployed.
Here’s our nagios instance happily monitoring our network.
And here I segfault the apache process
Nagios notices that apache is down and calls our handle_puppetrun script
Puppet gets invoked on our webserver, and restarts apache for us
And here Nagios has noticed that our webserver has recovered
And our network is happy again.
Other monitoring packages and CM tools
There’s no reason you couldn’t apply this to other network monitoring systems or configuration mangement tools. I originally had this running under cfengine, and you could use chef as well. Other monitoring systems also support event handlers, same as nagios
To find the examples and the original Oreilly article visit my site for this talk at this URL.
To wrap up, I’ve shown a setup where your network can self correct for its most common failures. We’ve used tools that many of you are already using – tied together in a novel way. And I am hopeful I’ve freed you up from some firefighting so you’ll sleep better at night and more productive during the day..