It is often remarked that backups are the most important thing many site managers know they should do but don’t. But there’s another important task that doesn’t get as much attention and probably should: site monitoring; without it, you have absolutely no reliable means of knowing when your site suffers an outage, even if only for a few minutes, during which time you’ve lost traffic or – worse – lost revenue. By monitoring you can know and, more importantly, respond to the issue immediately.
I would speculate that, unfortunately, more than 95% of site managers learn of outages from their users or stumble upon it themselves when they attempt to access their site. It goes without saying that this is not the way you want to run things, especially if that site happens to be your primary representation on the web. If you run a blog, monitoring is important; if you run a burgeoning e-commerce site it’s an absolute necessity.
Types of Monitoring
When it comes to site monitoring, there are several different classes of checks that can be performed:
- Service-level – These checks monitor each of the important services your site is supposed to offer, such as HTTP, SMTP, FTP or email. They work by connecting to the appropriate port to establish a connection that will determine if that service is indeed running – a simple up/down question being answered, but they may also gather some quantitative information, such as how long it took to establish the connection. Service-level checks are always external checks, probes from outside the server of those things that need to be running to fully support end-users; they won’t tell you why something is down. However, most sites will find service-level checks to be enough.
- System-level – These are the digital equivalent of looking under the hood: checks that examine all aspects of the physical (or virtual) server’s functionality, from the basic health of such things as memory, disk and bandwidth, to the status of various devices, such as RAID arrays. Unlike service-level checks, system-level checks can help you to detect issues before they become problems, such as low memory or disk conditions.
- Network-level – These test examine the network’s routers and switches. You probably shouldn’t worry about these unless you own the networking equipment and that equipment is capable of being monitored, typically via the Simple Network Management Protocol (SNMP). If you’re on rented servers, chances are you won’t be able to perform these checks. However, a good hosting facility will perform them themselves.
Monitoring Options
There are two ways to approach the monitoring problem. The first is to operate your own monitoring infrastructure using your own servers and one of the many commercial and non-commercial solutions, such as Nagios. But unless you have numerous servers, I don’t recommend this for two reasons. First, the learning curve associated with setting up and testing the monitoring software can be considerable. (Nagios, in particular, can be somewhat intimidating at first; of course, if you have a larger infrastructure, it is entirely worth the trouble and I highly recommend it) Second, since the monitoring software can’t intelligently be installed on the very same server being monitored, an extra machine will be required. (Actually, in theory it means TWO machines, because you need to be able to monitor the machine that’s doing the monitoring. Otherwise, if the monitoring server crashed you wouldn’t know about it.) This level of commitment means that most sites – which have but one server – will end up doubling their infrastructure just for the sake of monitoring. Most would find this an impractical solution.
The alternative to running your own monitoring infrastructure is to rely on one of the many 3rd-party services that have popped up in recent years. For a small monthly fee, you’ll get redundant monitoring servers, tests from multiple locations around the world, short setup times, SMS alerts and so forth.
Examples of site monitoring services include Pingdom, Watchmouse, Site24x7, InternetSeer, mon.itor.us and many more. All of them are what I would categorize as drive-by monitoring services because they perform only service-level checks. This is both good and bad: good, because service-level checks require no software installation on the server being monitored, which makes for quick setup; bad, because if you’re hoping to get some of the proactive benefits of system-level monitoring you’re out of luck. Still, they’re adequate for most purposes and are unquestionably better than no monitoring at all.
There is yet another, higher class of monitoring services, which specialize in performing both service-level and the more rigorous system-level checks. The costs for these services are ordinarily significantly higher than for purely service-level monitoring. You can find them by searching Google with the keywords, ‘system monitoring.’ If you have more than a handful of servers these services are probably worth looking into.
No matter whether you choose to install your own monitoring infrastructure or you instead use one of the several services I’ve discussed, the important thing to remember that monitoring is of the utmost importance; any of the choices I’ve outlined are better than no monitoring at all. The few minutes it takes to set one up is time very well spent.

Comments on this entry are closed.