The Black Liszt

Category: Software Quality

How Effective are Software Factories?

Software factories are truly excellent. They are highly reliable, with an error rate near zero. But here's the catch: software factories may be something different than what you think they are.

What Are Factories?

We all know what factories are.

A factory is one of those big plants where parts and assemblies go in one end, and through a series of steps, get turned into finished goods.

Factories have played a major role in creating the modern world, by magnifying the effort of humans with machines and power.

What are Factories Conceptually?

The purpose of a factory is to produce identical copies of a thing you already have and know how to build. Henry Ford's factory didn't produce the very first Model T — his design engineers did that.

Then they created a factory to churn out lots of copies of the original Model T. Everyone credits the Ford factory with producing cars at low cost. All factories accomplish their core function of producing copies at low cost by replacing labor with machines, and by reducing the amount and the skill of the remaining labor.

In addition, huge amounts of effort have been poured into factory cost-effectiveness and quality. Supply-chain optimization (how to most effectively get the inputs to a factory) is well-understood at this point. Similarly, both the theory and practice of factory output optimization is highly advanced. Methods for assuring consistent high quality have also been developed, starting for example with statistical process control.

The Dream of the Software Factory

We all know that software departments aren't very good at churning out great software. Wouldn't it be great if we could build a factory for software — a factory that spits out great, high-quality software, on time and on budget, just like all the other factories?

No need to dream — it's been done! There are big companies behind these factories, there are lots of books about how to build them, scary books about how it's being done better in Japan, everything you'd want.

Of course, when you look more closely, it's all hype. Most software isn't built in that kind of software factory; if it were, they'd have long since taken over software development, and no such thing has taken place.

The Reality of the Software Factory

Fortunately, there really is a software factory. It's effective, efficient, and it's quality is so near to 100% that it's not worth measuring. Furthermore, software factories are widely used — they're so much a part of programming life, that no one thinks much about them.

One of the most widely used software factories is the cp utility. This amazing utility does exactly what a factory is supposed to do — it makes you an exact copy of the thing you want to have. Amazing! This factory is also high adaptable. If you change what it is you want to churn out, it can still make a copy of it.

I wish the cp utility worked for cars. If it did, I could point it at my car and it would give me an identical copy. I could then make changes to my car, sic ol' cp on it again, and shazaam — a copy of the modified car! Cool! Sadly, cp doesn't work on cars (yet) — it only works on software — though let's not forget it also works on data, which isn't too shabby…

Factories and Software

I can only hope that people keep coming back to comparing the process of building software to a factory because computer software is so consistently bad. The motivation must be great indeed for so many people to fall for such an obviously flawed metaphor.

After all, factories are for building copies of things that are fully known and understood — like Model T cars — things that have already gone through an extensive design and prototyping process. Ooohh — I like that — let's make lots of them! Kind of like the iPhone, right? Foxcon didn't get a call from Steve Jobs asking them to start cranking out iPhones until the Apple design engineers had already designed and built them. That widely-reported last-minute switch of the glass surface of the phone happened after Steve played with the prototype and didn't like it — before the factory started doing its thing.

So repeat after me: designing is what you do to create the first copy of something; if you like it, a factory is used to crank out copies.

If you like a piece of software, cp (or the relevant copy utility) is all you need to make a copy of it! The only reason software engineers get involved is if you want something different. For which a "software factory" as properly understood is simply not relevant.

Conclusion

There's good news: Software factories exist! They are universally used in the software community! They work: they work consistently; they work quickly; they work flawlessly. Be happy.

April 13, 2012
Field-Tested Software

When you're at war, your software needs to work — not in the lab, but in reality. In the field! You don't have time to test your software in the lab, and you don't care whether it works in the lab. You need field-tested software. Software that works — in the field — where you need it to work.

Normal Software QA

Normal software QA pays lots of attention to the process of defining, building and deploying software. You hear phrases like "Do it once and do it right;" "quality by design;" "we don't just test, quality is part of our process." There are lots of them. They all, one way or another, promote the illusion that mistakes can somehow be avoided, and that we can — finally — have a software release that works, and works the way it's supposed to. This time we're going to take the time, spend the money, and do it right!

How did that work out for you? Most often, it's like predictions of the end of the world. The date comes, the world is still here, and people try to avoid talking about it. Similarly with that great this-time-we're-doing-it-right release, the release comes, there are roughly the usual problems, and people try to avoid talking about it. Or there were fewer problems, but the expense and time were astronomical. Or there were fewer problems, but not much got released. Whatever.

Here are some favorite phrases: "It worked in the lab!" "How could we have anticipated that case?" "The test database just wasn't realistic enough." "Joey So-and-So really let us down on test coverage." "We had the budget to do it right, but not enough time." "We didn't have enough [tools] [training] [experienced people] [cooperation from X] [lab equipment]…" Excuses, every one. Perhaps there's a fundamental reason why we always fail?

This is a subject your CTO and your chief architect should stop ignoring and pay serious attention to. It isn't the only subject, but it sure should be #1 on the list.

QA Should be Field-Based

Who cares how the software operates anywhere except in production?? The lab environment is always different from the production environment. And the most embarassing problems are the ones where it worked in the lab but failed when deployed. The number of potential causes is endless; different machines; different loads; different network delays; different database contents; different user behavior; different practically-anything!

Given this, why wouldn't you test your software on the actual machines it will run on when it's put into production? Of course, you don't load-balance normal user traffic to the test build as though everything were hunky-dory. That's just asking for trouble. But it's not hard to send a copy of the traffic to the test machine. That alone tells you huge amounts. Did it crap out with a normal load? Now there's a real smoke test. And there's lots more you can do as well.

Conclusion

Your customers don't care how your software worked in the lab. They only care how it works for them. Yes, that's awfully self-centered, but that's just how they are, and no one is likely to talk them out of it. So live with it, and shift from pointless lab testing and back-office quality methods to actual field-testing of your software. Yes, it's messy, dirty and uncontrolled — but it's real life! It's where your software has to run! Better it should get used to it sooner rather than later.

April 12, 2012
Why Software Quality Stinks
We know our users want high quality software. We know our software, in general, stinks. What’s our problem?

We Claim to value quality

We say we value quality highly. Who will admit to wanting poor quality?

But we value almost everything else more

Who would design a lawn mower without a “dead-man’s” switch – the kind of thing where you have to keep pressing it to keep the lawn mower running; if anything goes wrong, the mower stops and nothing bad is likely to happen. But when software is involved … watch out!

There was a case this year of a person being crushed by an elevator.

The elevator, of course, was controlled by software. The software was written to respond to someone pressing a button to go to a floor, but not to make sure all systems were “go” before sending the signal to start moving. So, while the person was entering the elevator, the bad software started the “go to a floor sequence” without proper checking, and before anyone could react …

The stories at the time blamed the workers for failing to follow procedures. They probably didn’t. But why would software be written to even make this outcome possible???

Why does bad software happen?

Poor quality doesn’t need to be achieved – poor quality happens all on its own, without our having to do anything to make it happen! That’s why it’s so wide-spread! It’s just like when you’re writing natural language. Who thinks that the text of Twelfth Night that we read today was Shakespeare’s first draft?

Beyond the fact that high quality is something that must be achieved, here are the most basic reasons it is so rare:
- High Quality is unrewarding
The typical incentive structure for quality practically guarantees we won’t get much of it. High-quality software doesn’t get kudos – it’s expected. You may get slapped if the quality is stupendously bad. When it’s adequate or better? Yawn.
- Our focus is wrong
There is a long tradition of attempts to improve quality in software. Think test-driven development. Think about all the people preaching how “quality” is different than (and better than) “mere testing.” Decades of effort have resulted in little to no progress, except for spending more time and money. Chief among the errors in focus is the way we put our effort into assuring the quality of the new stuff we’re building instead of what customers mostly care about, which is whether the old stuff continues to work.
- Bad quality is a side-effect of poor development methods
If you are using project-management-style techniques to build software, achieving high quality is nearly impossible. At least if you’re using “grow the baby” techniques, it’s possible to maintain a reasonable level of quality.

Conclusion

Most software stinks. We know it. We give lip service to quality, but little more. When we apply standard methods to achieve higher quality, we are rarely rewarded for our efforts. QA is one of the lowest status jobs in software, and most likely to be cut when there is pressure of any kind.

Given the situation, we get exactly the quality we should expect.
March 28, 2012
Internet Software Quality Horror Shows

Whether the software is a cool social app, an academic website or a real business, there is a common theme: the software is poorly designed and, even worse, it just breaks. As in falls flat on the floor, waves its arms in surrender, and just gives up. And not just once — it keeps breaking! As I've said before, we really need a revolution in software quality.

Cool Social Apps

Hey, social is where it's at — how can billions of Facebook users be wrong? Before long, there will be as many FB users as MacDonald's has sold hamburgers (billions and billions)!

Those guys must be great programmers, huh? I mean, just look at their office:

Here's one of them giving a talk at a conference:

See how cool he is? He's just wearing a t-shirt, not even "business casual."

The other social media are just as cool. Here's a "chill" Twitter office:

And Jack Dorsey, the Twitter CEO — quite the opposite of a buttoned-down financial guy, huh?

It's perfectly obvious that these guys must write just the coolest, most awesome code ever. There's no way people this cool could make elementary programming mistakes, particularly when their application is so very dead-simple, and hardly ever changes — they could spend practically all their time being cool and polish up some already-faultless code a couple times a day, and still be OK.

Except this little detail, which I scraped from my own screen, and which I personally have seen countless times:

Yes, the famous Twitter fail whale. I think Twitter got tired of all the publicity their "cute" failure message was getting them, so they reverted to something more discrete; here's an example:

FB is just as bad, of course, and they've always tried to minimize the message when they screw up:

Apparently, FB is incapable of keeping even the most recent day's worth of updates on-line — you should try going back in history and seeing how far you get. Oh, you thought the stuff you wrote was your data, did you?

Naturally, it makes sense to consider that you get what you pay for; all these cool social apps are, after all, free. You can hardly complain when something you didn't pay for is flakey — return it and demand a full refund!

So let's turn to a more promising field. Everybody's supposed to go to college and learn stuff, so…

Academia

Let's see if the universities do any better. I was just on a local college's website, and it was even worse than Twitter — Twitter's code knew it was screwing up and put up the fail whale. In this case, any number of links I hit encountered badly broken code:

Oh, alright. The colleges are perpetually underfunded, and putting up a website that works isn't a high priority compared to … all the other things they spend money on. I guess.

Probably a real business does it better, right?

Profit-making Big Company

Even more so, an essential public service, like the cable company! Those guys have the money, the funding, the experience and the mandate to do it right. Let's pick the case where their motivation is the highest: collecting money.

Oops.

Just a few days ago, I was on my local cable provider's site trying to access my account. Here's what I got:

Not just once, but repeatedly, for hours!

But maybe it's just TW that's got problems — surely all the other big companies do things great, with their huge staffs and policies and procedures and all, right?

Sadly, no. Here's just one personal example from Verizon:

Summary

There's no getting around it. Software is just bad. Everywhere. We can speculate about why this is the case, but let's agree on the facts: it's bad, and not getting better.

January 22, 2012
Software Quality: Theory and Reality

The theory of software quality makes my head hurt; the reality makes me want to cry.

There is a great deal of material written about software quality. It's a HUGE subject. It's also a diverse subject with lots of experts and lots to study. There is one simple reason for this: Software quality is a horrible %^*%^* mess, and it's not getting better!!!

Software Quality Theory Makes My Head Hurt

Just scan through the Wikipedia article on the subject and your head will probably hurt too.

I particularly like this alert at the top of the Software Quality Factors section:

This section needs attention from an expert on the subject. See the talk page for details. WikiProject Software or the Software Portal may be able to help recruit an expert. (September 2008)

Note that they've been seeking this expert for nearly three years!

Big government agencies have whole organizations devoted to the subject. For example, there's DACS, the Department of Defense (DoD) Information Analysis Center (IAC). What does DACS do? Read this (warning: reading this may make your head hurt):

Designated as the DoD Software InformationClearinghouse, specifically aimed to serve as an authoritative source for state-of-the-art software information providing technical support for the software community, the DACS offers a wide variety of technical services and supports the development, testing, validation, and transitioning of software engineering technology to the defense community, industry, and academia. DACS subject areas encompass the entire software life cycle and include software engineering methods, practices, tools, standards, and acquisition management. Also included are programming environments and language techniques, software failures, test methodologies, software quality metrics and measurements, software reliability, software safety, cost estimation and modeling, standards and guides for software development and maintenance, and software technology for research, development, and training.

I could go on and on, but my head hurts, so I'll stop.

Software Quality Reality Makes Me Want to Cry

With all these impressive-sounding things, books, conferences, experts, criteria, methods and certifications, software quality should be totally nailed, right? To the contrary: something is nailed when … people stop talking about it! Take the disease smallpox, for example. It's nailed! There aren't theories, experts, or much of anything beyond historical references and scare-talk about potential re-emergence.

This is one the better summaries of the reality of software quality that I've seen; ironically, it's from a zombie website for obsolete software written for a long-obsolete machine, that is/had been(?) run by a couple people from some little island in the Carribean.

July 11, 2011
Software Quality Horror Tales: Electronic Diversity Visas

The State Department of the US has inflicted unimaginable pain and suffering on tens of thousands of people throughout the world through their electronic Diversity Immigrant Visa program. It's a highly visible and public example of what's wrong with software development in general, and software quality in particular. Sadly, it's no different in principle from countless numbers of other projects, doomed from inception by inappropriate standards and techniques.

The Facts

The Diversity Lottery Program is just that — a lottery. More than ten million non-US-citizens worldwide apply for tens of thousands of slots that can lead to US citizenship. Only this time, after notifying the "winners," who started spending money most of them didn't have to comply with the requirements to complete the process, the State Department cancelled the lottery and invalidated the results. Why? A bug in the computer program that chose the winners.

The Human Consequences

From the Wall Street Journal:

Ever since he traveled from his home near Yaounde, Cameroon, on a scholarship to Michigan State University in 2009, Dieudonné Kuate dreamed of immigrating to the United States.

As a visiting graduate student in epidemiology, he marveled at the sophistication of the chemistry labs and the excellence of the teaching. There was no comparison to his university in Yaounde, where he shared a cramped 27-square-foot room with three other students.

One of eight children, Mr. Kuate grew up on a poor farm in the western plateau town of Banjoun. His parents couldn't read or write. Mr. Kuate is the only child in his family to complete university. "My dreams have been to be a top researcher in my field of specialty. The only place I see these goals being realized is the United States," says the 31-year-old Mr. Kuate, who returned to Yaounde last year and finished his Ph.D. in chemistry.

For the past six years, Mr. Kuate has applied for the State Department's annual green-card lottery, and, like 15 million other people, he applied again this year. The 20-year-old program offers about 50,000 people a year a chance to win permanent residence in the U.S.—and a ticket to the American Dream.

Denied six times, Mr. Kuate finally saw his number come up on May 1.

"There is no English word to express my happiness when I discovered that I was selected," said Mr. Kuate, whose first name means "God given."

Emmanuel Tumanjong for The Wall Street Journal

Within days, his older brother sold family land in Banjoun for around $4,400 for Mr. Kuate to use for application fees, medical examinations and to start a new life in the U.S., he said. His mother believed God had intervened: "According to her, I was going to travel to the white man's country and see how to help other family members who have not gone far in book work," he said.

But on May 13, those hopes were abruptly dashed. After logging on to the State Department website, Mr. Kuate said, "I saw a message saying the lottery had been canceled."

Mr. Kuate was among 22,000 people around the world mistakenly informed last month that they had won the lottery. There had been a technical glitch and the lottery would have to be held again, the State Department said, explaining that a computer had selected 90% of the winners from the first two days of the application window instead of the full 30-day registration period.

The software

It's pretty clear that this is one of the more trivial programming jobs on the planet. I shudder to think how much it cost to build, how long it took, and the whole environment that was created that made it (I'm sorry to say) likely that a horrific bug like this would be inflicted on so many innocent people.

Since I have no access to the code or project documents, I will comment on a couple of things that are publicly available.

Take a look at the Department's page that announces the status of the 2012 lottery. Play around with it a bit, as I did.

Did you want to find more information? Did you take advantage of the kind offer to provide more information:

More information is available on our website:

http://dvlottery.state.gov

Perhaps, then, you noticed that the link leads you to the same page you're already reading!! Kafka couldn't have done it better. No doubt this was the careful work of the Division of Self-referential self-reference of the Department of Redundancy Department.

Did you take note of the fact that all entries were submitted electronically between October 5, 2010 and November 3, 2010? Which implies that starting on November 4, 2010, they had all their input data? All they had to do was run the lottery program a couple times on the input, run some checks to make sure it was working properly on the new data set, then run it "for real" and publish the results. To be generous, this should have taken about a day. OK, it's the government, we'll give them a week. Really? Geeez…alright, a month! No. NO WAY. $%&$%^& SIX MONTHS!!!!??? ^&(^^&* MORE THAN SIX MONTHS???!!!

With that much time, this should have been the most proven-to-be-perfect program in history. PhD students should have been able to break new ground in proving the certainty of correctness of this program. It should have been possible to run it a number of times that compares favorably to the number of grains of sand on all the beaches on planet earth.

I love the fact that there's a transcript of a statement on the subject by "David Donahue, the Deputy Assistant Secretary of State for Visa Services." The statement location and date are unspecified. The date of posting is not given. The fact that he made a statement verbally rather than just talking with the public via the web site kind of implies that he's incapable of writing or typing. His "internet department" (or whatever) must be responsible for the web site. And it implies that he still has a job! For some reason, I find that one really annoying! I guess you can screw over incredible numbers of people on behalf of the US Government and suffer no personal consequences. It must be OK to do that!

There's a lot more to be said about this fiasco, but I'm tired.

Conclusion

Software quality. We need a revolution! Stop the Horror! End the Terror!

July 1, 2011
Software Quality: Horror, Failure, Tragedy and Ineptitude

Most software stinks. Too much software is a real horror show. Loads of software is discovered to be rotten to the core before it is widely inflicted on customers, leading to missed budgets, deadlines and crippled businesses. An amazing amount of software, already bad and late, is still inflicted on the world, sometimes creating true tragedies. The only things in software that are more riddled with stupidity and incompetence than designing and building it are the methods used for assuring its quality.

We should be running around in panic, trying to strengthen the levees against the rising flood of software problems; instead, we have smugly promoted standards, careers and processes. In the face of repeated, costly failures, the standard prescription is to do more of the same, except spend more money and take more time.

"Software quality" should be added to the cynical pantheon of "southern efficiency" and "northern charm," words that we know just don't belong together.

I have spent more than 40 years in the software industry, and have made my share of mistakes. I'm no more perfect than anyone else. But through my personal experience of writing a great deal of code over decades, and now from my vantage point of getting an inside look at the work of many software groups, some unmistakeable patterns have emerged.

The key observation is that software development, as it is taught and practiced nearly everywhere, is not a collection of best practices that are constantly being refined and improved over time. It is a disorganized mish-mash of ideas badly adapted from other fields whose practices are simply inapplicable to software, and then patched and elaborated to make them sound good. This observation applies even more strongly to software quality.

I accuse no one of bad intentions. It does, however, make me angry and sad that our field is so burdened with pre-scientific, faith-based dogma that doesn't appear to be getting better. We don't need evolution here. We need revolution.

I hope to illustrate these claims in a series of posts. They deserve a cover like this one:

Are better methods available? Yes. they. are.

June 29, 2011
Why Computer Software is So Bad

There are aspects of the theory and practice of computer software that drive me nuts.

I look at a widely accepted theory or practice, and am aghast that the cancer spread and became part of the mainstream. I look at a genuinely good practice and am shocked at the way everyone lies and slithers to make it sound to the unsuspecting that whatever lousy thing they do is a shining example of a good thing. I look at a nearly-universal approach to problem solving that may have made sense many Moore's-Law generations ago, but has long since been rendered irrelevant by advances in hardware. And I look at proven, understandable techniques that improve productivity by whole-number factors that are spurned and/or ignored by most people in the field.

How significant is this stuff I'm complaining about? A minimum of a factor of 10. Sometimes a great deal more. So it matters! This is theoretical, but it has real-world, practical implications. It's the only way, for example, that small companies can beat large, established ones that have software staffs 10 to 1,000 times larger.

I think I understand some of the causes of this deplorable state of affairs. But that doesn't make me feel better. And it definitely doesn't cure the sick or empower the unproductive.

A number of my private-distribution papers go into these subjects in considerable depth. But I thought it might be interesting to summarize a couple of the main observations here.

Among the causes of bad software are:

The blind and deaf leading the blind. In the majority of cases, the people in charge have little to no personal experience of creating software, and no interest in how it's done. This is about as sensible as having someone in charge of a baseball team who not only can't play baseball, but can't see onto the field where it's being played; all he can do is see the score at the end and hear reports from the players about how things are going. So everything turns into politics — convincing the blind, clueless boss that you're the great contributor, everyone else is a chump. The larger the organization, the more likely it is to be led by such a genius.

Process over substance. The larger and older the organization, the more likely it is to elevate the importance of process, and the more elaborate and all-consuming that process is likely to be. The more process there is, the less code gets written, and the productivity of the innocent few who actually want to work gets ground down to the abysmal norm.

Lawyers. Shoot the lawyers! Shoot them! They and their legal ways are a plague. When lawyers see a problem, they want to write a law or create a regulation that makes the problem go away. Except that instead of saying this in simple, results-oriented terms (e.g. "programmers should not be allowed to die unnecessarily while writing code"), they say it in terms of incredibly elaborate, micro-managing, this-is-how-thou-shalt-do-it regulations (e.g., "programmers should be offered a nutrional meal no later than 12 noon local time each day consisting of no less than…" and 538 similar instructions, constantly growing). If regulations of this kind actually led to good results it would be one thing. All I should have to say at this point is "credit card theft" and PCI.

Fashion over function. Programmers are supposed to be nerds, aren't they? How can programmers let their decisions be made by "fashion," whatever that's supposed to be in the world of software? I refer you to the first point, about software teams being led by people who are clueless. These people want to seem smart in the eyes of their bosses and peers, who know even less than they do. So they make decisions that are guided by C-office fashion trends, which are usually laughably out of step with true optimal decisions.

A preference for bad new ideas over good older ones. How do bad ideas catch on? Some company promotes them. Books are written. Appealing rhetoric is created and repeated. Suddenly a fashion trend is born. Someone can appear to be smart and modern just by advocating the cool new stuff and sounding smart, without actually knowing anything or doing anything. Everyone involved thinks they're helping advance the state of software, when in fact they're digging the hole of despair deeper. It's remarkably like when the dumbest guy in Scotland moves to England, thereby increasing the average intelligence of both places.

A preference for bad old ideas over good newer ones. There aren't many good new ideas; mostly there are old good ideas that are still good. But because conditions change (like processors getting fast and memory getting cheaper), ideas emerge to respond intelligently to the new state of affairs (like this one), instead of blindly and stupidly continuing on as though nothing had changed. Like most people do.

Wrong scope; usually too narrow. This is classic optimization over too narrow a range. Much bad software results from compartmentalism or simply failing to look at things through the eyes of the ultimate consumer. This is an incredibly common mistake. The trouble gets really bad when is the practices that result from the little "silos of excellence" get elevated into industry "best practices." (Whenever I hear the phrase "best practice," it's really hard for me to avoid getting a serious look and pronouncing in deep tones "you folks had best practice a bit longer before you inflict yourselves on the world of paid, professional programming.")

I'm sure there are many more causes of bad software than the ones I've listed here — we haven't even gotten into bad programmers in their innumerable incarnations! But every item on the list above thoroughly deserves inclusion, even more because most of them are not listed by most people as a cause of software malaise.

June 21, 2011
Single Point of Failure: Logical vs. Physical

People who want to build a highly available computer system tend to focus on eliminating single points of failure. This is a good thing. But we tend to focus only on the physical layer. We don't even notice the single points of failure at the logical layer. Logical single points of failure are just as likely to result in catastrophe as physical ones, and it's high time we started paying attention to them!

Why? Keeping your system up and running is the most important job of an organization's techies.

Physical redundancy

We eliminate single points of failure by having more than one of every component, and a structure that enables the system to keep running with a failed component, while allowing it to be repaired or replaced. For example, here is a typical redundant system diagram (credit Wikipedia).

And here are instructions for replacing a hot-swap drive in a redundant design (credit IBM):

Logical vs. Physical

We are familiar with the concept of logical and physical in computing. All of computing is built on layers and layers of logical structures. Frequently, we call something a "physical" layer which is actually the next layer down the stack of logical layers; we call it "physical" only because it is "closer" to the actual physical layer than the layer we call "logical." A good example of this is in databases, in which it is common to have a "physical" database design and a logical (higher level) one; of course, calling it a "physical" design is a joke, there's nothing physical about it.

Keep eliminating physical single points of failure

I am not arguing that physical redundancy isn't important or that we should stop eliminating physical single points of failure. When I see people running important computer systems that have single points of failure, I tend to wonder how often the people in charge were dropped on their heads onto concrete sidewalks as children, and how they manage to feed themselves.

The principle is simple: if you have just one of a thing, and that thing breaks, you're screwed. For example, if you have your database on one physical machine and that machine breaks, no more database.

What is a logical single point of failure?

I think people don't pay attention to logical single points of failure because it just isn't something anyone talks about. It's not part of the discourse. Let's change that!

A prime example of a logical single point of failure is a program. You create physical redundancy by running the program on several machines. Great. That covers machine (physical) failures. What about program (logical) failures? After all, program failures (i.e., bugs) hurt us far more often than machine failures. But somehow, we don't think of a program as a logical single point of failure. We think that the program has bugs, that our QA and testing weren't good enough, and we should re-double our QA efforts. And somehow, miraculously, for the first time in history, create a 100% bug free program. Ha, ha, ha, ha, ha, ha, ha.

Suppose you have version 3.2 of your program running in your data center. If that program is running on more than one machine, you have eliminated the physical single point of failure. If version 3.2 is the only version of the program that's running in the data center, then version 3.2 is a logical single point of failure. The only way to eliminate it is to have another version of the program also running in the data center!

Eliminate logical single points of failure, too!

Smart people already have a data center discipline that eliminates logical single points of failure.

Suppose there is a new version of linux you think should be deployed. Do you stop the data center, upgrade all the machines, and start things up again? While that may be the most "efficient" thing to do, from a redundancy point of view, it's completely insane. Smart people just don't do it. They put the new version of linux on just one of their machines, and see how it goes. If it runs well, they will deploy it to another machine, and eventually all the machines will have it. It's called a "rolling upgrade."

Same thing with the application. Great web sites change their applications frequently, but using the rolling upgrade discipline, and if they're really smart, with live parallel testing as well.

Getting into the details of application design, the best people go beyond this method to create another logical layer, so that the things that change most often are stored as "data," in a way that changes are highly unlikely to bring down the site. A simplistic example is content management systems, which are nothing more than a way of segregating the parts of a site that change often from those that don't, and keeping the frequently changed parts (the content) in a non-dangerous format.

Conclusion

There is little that is more important than keeping your system available to your customers. Eliminating single points of failure is a cornerstone activity in this effort. Many of us are well aware of physical single points of failure, and eliminate them nicely. It's time for more of us to include logical single points of failure in our purview, and to eliminate them with the same vigor and thoroughness that we do the phsical kind.

April 30, 2011