“If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization.” – Weinberg’s 2nd Law
If this was Weinberg’s 2nd law, his first must have been a real humdinger.
The Cause of WordPress.Com's Misery
Weinberg was a prophet, albeit one of uncertain origins. It isn’t clear when this famous adage was first coined or if Weinberg was a even real person. I’ve been quoting his (her?) words for at least 20-30 years, so they must certainly be older than that. Googling the quote turns up lots of references but no clear biographical information or dating. It has been a source of amusement for many since it first appeared, but it’s also a warning to those who, when they stop chuckling, wisely ponder its deeper meaning. I have, and whenever I’ve experienced a calamatous failure of trivial origin, there is Weinberg, softly whispering in my ear, “Dipshit!” I’ve heard the words and grasped the point, but, like everyone else in the industry, I haven’t always heeded the call for change.
It seems certain that this adage dates to a time when software was, more or less, the entire system; before systems had reached the multi-layered level of complexity that is now the norm. Computers have always been complicated, but in Weinberg’s time, there was probably only machine failure and software failure: networks; routers; switches; clustered web servers; load balancers; multi-master, replicated database servers; and all the other components that constitute a modern web site didn’t even exist. Weinberg blamed programmers because, then at least, they were likely the only ones to blame. Now, our problems are bigger and there are more potential culprits.
So, Weinberg’s law, while remaining fundamentally true, needs some revision. Programmers took the brunt in his era, but today the fault could just as easily lie with the database administrator who inadvertently drops the wrong table, the system administrator who power-cycles the wrong server or a system architect who failed to foresee a critical design weakness that brought an otherwise finely crafted system to its knees, like an orchestra with half its instruments out of tune and a conductor who has lost control of what is being played.
Weinberg’s 2nd Law was very much on my mind yesterday as I recounted two small errors that had major real world consequences for myself. Then, last night, WordPress.Com went down, taking millions of blogs with it. For about an hour, any site hosted on their network was inaccessible, from the smallest blog no one ever reads all the way up to some very big names, like GigaOm and parts of CNN. After fixing the problem, Matt Mullenweg, founder of WordPress, responded:
The cause of the outage was a very unfortunate code change that overwrote some key options in the options table for a number of blogs. We brought the site down to prevent damage and have been bringing blogs back after we’ve verified that they’re 100% okay.
Decoding the phrase, “a very unfortunate code change,” I deduce that one or more people at WordPress may be clearing out their desks today.
Matt, now that your phone has stopped ringing and things are little quieter, do you hear that voice? The one softly repeating, “Dipshit, dipshit….”? It’s Weinberg calling. Again.
The true nature of this outage may never be known. I’m sure WordPress wouldn’t want to suffer more embarrassment in front of their customers and investors by revealing that best practices were not followed, that this epic fail might never have happened if the ‘unfortunate code change’ had been properly tested in the first place. While insignificant web sites run by one person might get a free pass in situations like this, a large organization responsible for a big chunk of the web’s blogs can’t credibly claim lack of resources or ignorance of process.
WordPress.Com undoubtedly has tons of redundancy and monitoring in place to spot problems and reduce the possibility of failure, as any large-scale modern web infrastructure must. The point of which is to construct a system that has no single points of failure, where, no matter what you unplug, intentionally or otherwise, the system continues to run. But if one peels away the layers of finely crafted computing armor to peek at the system’s core – its software and databases – you will easily find its most critical weaknesses, where single points of failure abound and where one false move causes the intricate structure to completely collapse. One unfortunate code change brings the whole house of cards crashing down.
WordPress.Com is not unique in this respect. Testing and procedural questions aside, their systems are undoubtedly designed with same weakness that exist in every other system. This particular failure is what has exposed them to scrutiny, but similar high-profile failures of other large-scale services (Google, Amazon, et al) merely illustrate that no one, not even the most adept technology organizations on the planet, are or can be immune from such a fundamental weakness. Sooner or later, Weinberg and his cousin Murphy will arrive to humble every system’s builders.
Eliminating this intractable problem forever will involve a fundamental change in how systems are designed and built, for the woodpecker didn’t just destroy the system, he also built it.
We have met the woodpecker and he is us.
Postscript: A lot has been written about this outage, much of it by irate bloggers whose sites were affected and with whom I can sympathize. Of all the headlines I’ve read, perhaps the best was by Techcrunch: WordPress Gives Us The VIP Treatment, Goes Down On Us Again But as good as it may be, it can’t hold a candle to the mother of all cheeky headlines, written by The New York Post, in 1982: “Headless Body in Topless Bar“