Web Sites Gone Wild

by Michael Johnston

I call this blog The Well Run Site because I tend to write about a subject near and dear to my heart: how to run web sites properly, by forgetting the small stuff and stressing the importance of testing and monitoring, among other things.  I’ve even published a prioritized list of things you should look after.

I’m no hypocrite, and I always do my best to practice what I preach. But shit happens.

I don’t post here as often as I should, mostly because I’m so busy handling tasks for my clients. I suppose that’s a good thing: with so many people out of work, I’m thankful there is a demand for what I do. The downside is that The Well Run Site doesn’t get as much attention as it should.

This afternoon, I discovered a problem with this blog that meant for the past several weeks I’ve been serving up readable but very ugly web pages, all because of a plugin conflict that generated mangled CSS. The problem was entirely my fault: I installed a new plugin and didn’t completely test the site afterwards. I should know better, and the irony of being caught not doing something on my own site that I fervently advocate to others is embarrassing.

This brings to mind a problem I had about six months ago with a client’s site. Not long after a redesign that was, in part, created to improve that site’s SEO, I began to notice a very dramatic decline in its search traffic. ‘Dramatic’ actually understates the scope of the problem: after a little digging in Google Analytics and testing search results, I realized that the site had been virtually erased from Google’s index.

My first inclination was that we’d been blacklisted for some unknown reason, perhaps for serving up malware or who knows what else. After spending the better part of a day looking at just about every possible thing I could think of and finding nothing out of the ordinary, I checked the site’s robots.txt file and found this:

User-agent: *
Disallow: /

That’s a doozy. The robots.txt file tells search engine spiders what should and what should not be indexed. The directives in this example say, “No spider should index any page on this site.” Ouch! Though not every spider honors the directives in robots.txt, Google certainly does – and they faithfully removed every one of that site’s pages from their index. That explained why search traffic had disappeared.

But knowing what this would do, who in their right mind would add that file? I certainly hadn’t put it there. With a long background in system administration and general security matters, my first impulse was  that the site had been breached and that the attacker had replaced the default robots.txt file with the malicious one shown here. But before assuming the worst, I checked the modification date of the file and found it was identical to the date we’d rolled out the aforementioned SEO-related changes. That was too coincidental to ignore.

I called the developer who’d worked on the project with me, explained the situation to him and asked if he had any idea how that file had gotten there. After a long pause, he replied (with pain in his voice), “That was my fault. I didn’t want the pages on the development server to get indexed so I added the file. I forgot about it when we copied the source to the live site.”

Mystery solved. It wasn’t malicious; it was simply an oversight. And there was nothing I could say to the guilty party that would make him feel worse than he already did upon learning of his own mistake. I replaced the bad robots.txt file with the correct version and within days search traffic returned. For a small error, though, the consequences had been pretty large: thousands of search engine visitors had been lost.

Over the years I’ve seen lots of things like this happen, things that are usually the result of careless mistakes. No one sets out to make mistakes like this; they  just happen. There are quite literally thousands of things you need to remember if you want to have a site where everything is done according to best practices and nothing is left to chance. No one person can remember them all. No one. Even large teams sometimes miss things, like forgetting to renew a domain name or SSL certificate. At my last startup, I put together  monitoring server that literally made tens of thousands of quantitative and qualitative checks on our network every hour, and still things would occasionally creep through our web of detection.

The problem is vexing and widespread. I’m sure I don’t know anyone who hasn’t had something similar happen to them. I’ve been thinking quite a lot about this subject recently and I’ve come to some conclusions about how to address the issue. In a few weeks, I’ll post on this subject again.

  • Kminks

    Pretty funny ending. These days everyone expects a saboteur

  • http://thewellrunsite.com/2010/06/11/wordpress-outage-woodpeckers-web-sites-weinberg-oh-my/ Wordpress Outage: Woodpeckers, Web Sites and Weinberg. Oh, My!

    [...] 2nd Law was very much on my mind yesterday as I recounted two small errors that had major real world consequences for myself. Then, last night, WordPress.Com went down, taking millions of blogs with it. For about [...]

Previous post:

Next post: