The Black Liszt

Category: Data Center

Summary: Computer Data Centers and Networking

This is a summary with links to my posts on computer data centers and networks. The two subjects are intimately related because the whole point of networks is to connect computers with each other.

Like everything else in computing, fashions have a strong impact. A bit over a decade ago, the world of computing started yammering about “the cloud” and “virtualization.” These things were the hot subjects all the cool kids talked about. If you weren’t driving towards moving to the cloud, you were obsolete. The reality of course was much different.

https://blackliszt.com/2011/12/the-name-game-of-moving-to-the-cloud.html

The Cloud is just another virtue-signaling fashion word like Big Data.

https://blackliszt.com/2012/04/im-tired-of-hearing-about-big-data.html

The fact that the underlying reality hardly changes at all is way beyond the technical knowledge of the vast majority of the people who talk about it.

https://blackliszt.com/2012/05/the-cloud-and-virtualization.html

When something like “the cloud” heats up, all the vendors related to it rush to promote their products as ideally suited to the new thing.

https://blackliszt.com/2013/04/storage-vendors-in-the-cloud.html

The managers of large in-house data centers rarely know much about what they’re buying and what the alternatives could be. As a result, they often spend way more than they need to on equipment, something which competing cloud vendors are less likely to do.

https://blackliszt.com/2014/08/data-center-managers-spend-too-much-on-equipment.html

Like anything where people are involved, there are politics and fights over who should perform a given function. One of the classics is making replications of the data in DBMS’s for safety purposes. The data center people want to use their mechanism and the DBMS people want to use theirs. There is a clear right answer, but it only wins sometimes.

https://blackliszt.com/2014/03/replication-good-idea-storage-replication-nah.html

One of the key things people who write applications want is for them to “scale,” i.e., be able to handle any load without slowing down. The way some systems applications are built, based on decades-old designs, can get in the way. But there are solutions.

https://blackliszt.com/2014/02/obstacles-to-scaling-centralization.html

People also want their applications to continue to be available in the event of data center failure. For people working in data centers, there are perverse incentives at work.

https://blackliszt.com/2010/04/is-your-site-working-do-you-really-care.html

To be fair to data center managers, applications originally written decades ago expect to be able to run in modern data centers. It’s tough!

https://blackliszt.com/2010/01/paleolithic-mainframes-discovered-alive-in-data-center.html

Way back in 2015, it was clear that hardware had evolved to make dramatic improvements possible in software.

https://blackliszt.com/2015/08/the-data-center-of-the-future.html

With software people ignorantly focused on network-connected services as the way to build scalable applications, they assure that the incredible hardware power will be minimally utilized.

https://blackliszt.com/microservices/

Network neutrality was a hot topic and still comes up. The idea is that everyone should be charged the same for internet access and services. When you dig into the subject with actual knowledge of the technology, you see that the whole furor by virtue-signaling ignorance.

https://blackliszt.com/2014/11/net-neutrality-it-aint-broke-dont-fix-it.html

One of the arguments made about net neutrality and legal privacy provisions was unusually divorced from reality.

https://blackliszt.com/2017/04/on-the-internet-youre-naked-and-for-sale.html

Years later, a weak form of net neutrality was repealed. There were demonstrations with US Senators making predictions about the disaster that would ensure. Here’s an analysis of the non-disaster a year later.

https://blackliszt.com/2019/06/the-aftermath-of-the-net-neutrality-disaster.html

August 31, 2023
The Data Center of the Future

The elements of the data center of the future are mostly available today. Everyone is pretty used to the data centers they've been using for a while, which are thoroughly grounded in the past. So they keep building new copies of hoary architectures. But the parts and ideas are available to anyone who chooses to avail themselves…

Ancient Computing History

Back during the first internet bubble, say around the year 2000, data centers would likely have lots of Intel Pentium Pro microprocessor chips in their servers. It was an amazing device at the time. It had over 5 million transistors on the chip. It was so powerful that it was used in the first supercomputer to reach the teraFLOPS performance mark. Pretty amazing.

But building applications to support internet-scale applications was still hard. The clever software engineers of the time worked out ways to distribute the work among a collection of computers to get the job done, quickly and reliably. It was called, appropriately, "distributed computing."

Ancient Storage History

Back in the halcyon days before the internet, disks were just hooked up to the computers whose storage they maintained.

In internet-scale data centers, that wasn't good enough. There was always too much storage where it wasn't needed, and not enough where it was. Those bright computer guys got another good idea: we already have a local area network for connecting servers to each other. How about a storage area network for connecting computers to storage?

Ka-ching! Problem solved!

Today

Things have moved along. The latest microcomputer chips from Intel, for example the Xeon Processor E7 v2 family, have grown from millions of transistors to … Billions of transistors. That's Billion, with a "B," as in 1,000 times more transistors than the number in the Pentium Pro line. Instead of a single, single-threaded core, there are 15 dual-threaded cores, a total of 30 effective processors, each awesomely faster than the single core in the Pentium Pro. And it support about 25 times more main memory.

Each server with a Xeon E7 can handle at least 30 times the workload of the Pentium Pro, maybe 100 times. Here's a picture of the evolution:

The average data center architecture? Unchanged. Still emphasizing distributed computing, LAN and SAN, all the stuff invented to solve a problem of limited computing power that has long-since disappeared.

The Future (for most people) Data Center

This data center can be yours in 2015 — if you chose to build it.

The core principle is simple: use the cores! And everything else that's there! Get rid of obsolete architectures, chief among them that of "distributed computing" (with just a couple of exceptions), and drastically reduce the parts count. Here's what it might look like:

Note all the cores in the big (in power, but physically small) server boxes — plenty of room for "distributed computing," inside a single chip. There are loads of cores; devoting a couple to handling storage functions eliminates boatloads of parts and connections, the whole SAN, without loss. You can easily arrange that most of the time, the storage is connected to the box the apps that use it run on. If not? No problem — just a single hop to the storage software on the right box, and you've got yourself a virtual SAN, faster and for less money.

Conclusion

The data center of the future is still in the future for most people. The parts are there. The concepts are there. But old habits appear to be hard to break…

August 18, 2015
Data Center Managers Spend too much on Equipment

There are lots of complicated IT decisions to make. Buying hardware should be one of the easy ones. Most data center managers do make it easy — for themselves. But way too expensive for their organizations.

Piles of money are spent on data center equipment

According to a recent Gartner report, more than $140 billion dollars will be spent on data center equipment this year. That sounds like a big number. It is a big number. But then when you read that IT spends more than twice that amount on enterprise software, and then three times that on IT services (nearly a trillion dollars), maybe it doesn't sound so big.

Getting back to reality, most of the companies I usually deal with don't spend billions, hundreds of millions or even tens of millions a year on equipment. But it's still a lot to them!

The Huge Spending is rarely examined

The smaller companies I'm closest to spend remarkably little time seriously questioning the huge (to them) amount of money they spend on hardware every year. Cutting the number by a significant factor can make a huge difference to them.

I see this curious lack of interest from the other side as well. Some of our companies have great equipment to sell that enables their customers to get more for less. While they love to go into details about how wonderful their stuff is and how hard it was to make it wonderful, the bottom line is simple: it delivers more and costs less. You would think this simple message would be easy to deliver, and quickly result in lines out the door to buy stuff. Not so! Getting more for less turns out to be pretty low on the priority list for most data center managers.

Bergdorf and Target

The fact is, most data center managers have no idea what they're buying, and their managers know even less. No one knows how much various things "should" cost, and it changes all the time anyway. If you claim you saved your organization money, it's hard for anyone to evaluate the truth of the claim, so you don't get much credit for it. Whereas if something goes wrong with stuff you bought, it's clear where the finger of blame will be pointing.

The situation is amazingly different than normal life. Most data center managers, in effect, shop at Bergdorf Goodman.

The Bergdorf name and high levels of service makes them feel like they'll come out looking sharp. What if they could support that new application (the new school year) at Target?

What if they could get lots of everyday things that are perfectly adequate for less there?

The trouble is that we all know about clothes. We have lots of personal experience wearing them and seeing them on others. We know what they mean and have an idea of what they cost. But when it comes to data center equipment, even most of the professionals are clueless!

The result is that they're petrified of making a mistake and taking the blame, so the default position is "buy more of what I already have." And above all, take comfort in big brands. This explains why the vast majority of data center purchases go to the computer equivalent of high-end, services-rich Bergdorf, while places like Target and Walmart remain tiny little places by comparison. The computer equivalent of Ikea, which requires assembly at home? Practically non-existent.

Computer people are supposed to be so smart!

Yes, the reputation is that computer people are smart, sophisticated folks. They deal with a deep, complex, rapidly-changing set of products and services. Their skills are increasingly at the heart of most organizations, both private and government. It certainly takes a great deal of skill to avoid embarassment with all the jargon, not to mention the realities of buying the optimal equipment at an optimal price.

There is a solution to the complexity. The buyers understand that very few people are in a position to judge how well they're doing at their jobs. So long as things don't break too often, they can spin how well they're doing. They understand that no one has a sense of what's expensive and what's cheap. The vendors get this too. They shower their clients with service and support. They're not like Target, they're like Bergdorf. They help you pick the right things, the things that will look great in your data center. If they cost five times more than you could get at one of those crappy, wander-the-aisles-you're-on-your-own stores, who cares? You come out lookin' good! And that's what matters.

It turns out the computer people are smart — at advancing their personal careers. They've figured it out and are executing well on it, and who's to say otherwise?

That's the status quo among data center managers.

Along comes the Cloud

There are storm clouds on the horizon. It's called, oddly enough, the "cloud," which is just a modern term for an outsourced data center that's easier to use than the ones usually built by data center managers. It works. It's flexible. It's cheap! Most of the people who build successful clouds make decisions that are closer to Target than Bergdorf when buying hardware. And, guess what, it works just fine.

Smart data center managers typically "embrace" the cloud, which in reality is along the lines of "keep your friends close and your enemies closer." But that's a long story for another time.

Data Center Spending

Data center managers have jobs that are as complex and challenging as they get. It's hard to learn the basics of everything they have to know, much less keep up with all the changes. Most of the ones who keep their jobs have evolved simple, effective methods for buying equipment, in collaboration (collusion?) with the leading vendors. The methods guarantee that more will be spent on data center equipment by whole-number factors, but the stuff they buy mostly works, and they get to have great careers. While the future is looking a bit "cloudy," I suspect that these resourceful people will work something out so that their own futures remain sunny.

August 12, 2014
Replication: Good Idea! Storage replication? Nah!
Everyone knows losing your data is a bummer. If you're in charge of your organization's data, you know that losing data is the shortest path to "don't let the door hit you on the way out."

All the ways to assure your data is still available when you wake up tomorrow share a common theme: "make a copy." This is such a popular theme that it has turned into a theme-and-variations: "make a copy; make another copy; copy the copy; etc."

This sounds simple, but we all know that in computing, stuff is supposed to be complicated. Sure enough, this simple "just copy it" theme has gotten mired in hotly competing ways to get it done. And of course, there are politics — whose responsibility is it to assure against loss?

So let me boil it down: there are two basic ways to do the copy:
1. The guys in charge of the data, the storage guys, should copy the data from the original bunch of storage to a second bunch of storage.
2. The guys who write the data, the applications or systems guys, should get their applications or systems to talk to each other and write the data twice.
The only reason this is hard is that politics and history are involved. If you had fresh, educated people starting from scratch, it would be no contest: way number 2 wins, almost every time. It's faster, cheaper and easier than way number 1. But since when can we wave a magic wand and eliminate politics and history? The reality is, storage guys own the data, they want to protect it, and so they (usually) really, really, REALLY want to be in charge.

Here's why they shouldn't be.

You've got two sites, number 1 and 2. Each one of them has a database and a bunch of storage. Transactions come into site 1 and get written to storage.

Here's a simple transaction that might be written to the database.

It's a SQL statement that says the DBMS should write the transaction into the transaction table. The transaction contains the usual fields, things like the unique ID for the tranaction, the account number it's applied to, the amount of the transaction, etc. This is usually a simple string, a line or two long.

When the database processes the transaction, it gets complicated, of course.

When the Insert statement goes to the DBMS, the DBMS has to write the transaction itself, but it also has to write at least a couple of the fields to index tables, kind of like card catalogs in old-style libraries that let you find where things are. Indices typically use well-know things called b-trees, which may require a couple of writes to create a multi-level index, for the same reason you put related files into sub-folders so you have some chance of finding them later. There will certainly be an index for the account ID and one for the account number. Finally, there's a log to enable the DBMS to figure out what it did in case bad things happen.

All this happens when the Insert transaction comes in. One simple request to the DBMS, many writes and updates to the storage, usually involving reading in big blocks of data, modifying a smal part of the block, and writing the whole thing out again.

Now we come to the crux of the matter: how do we get the data over to site 2? Does the DBMS at site 1 talk with his buddy at site 2 to get it done, or are the relevant storage blocks in site 1 copied over to site 2?

In the diagram, I show the DBMS doing the job in green and the storage doing the job in red.

You'll notice that the DBMS only has to send a tiny amount of data over to site 2, essentially the insert statement. Once it's there, DBMS #2 updates all the storage, something it's really good at doing.

To replicate the data once it's been stored (in red), HUGE amounts of data need to be sent over the network to site #2. It's not unusual for the ratio to be hundreds or thousands to one.

Sending data between sites is a relatively slow and expensive operation. That's why, if you want replication that's fast, reliable and inexpensive, you want the application to do the job, not the storage.

The storage replication people don't like to talk about the things that go wrong, but of course they do. What happens if some of the blocks get over but others don't. Or they're out of order. Or syncing with the database doesn't happen. Or any number of other bad outcomes.

Other applications

I'm using a database application to illustrate the principle, but similar dynamics work out with other applications. All major databases can replicate (Oracle, MySQL, SQLServer, MongoDB, etc.), the major file systems can replicate (for example Microsoft has VSS), and all the hypervisors can replicate.

The hypervisors are amazing. The first thing the storage guys will come back with is how many different applications you have to fiddle with to protect their data. The answer of substance is that the incremental effort for each application is truly trivial, well under 1%. The quick answer is that hypervisors (VMware, Hyper-V, etc.) are universal, and their replication is superior to storage replication. This is exactly why, as organizations move their data centers to the cloud, they are abandoning expensive, inefficient storage vendor-lock-in features like replication in favor of doing it in the hypervisor.

Conclusion

You have to protect and preserve your data. Non-negotiable. The storage guys used to have a monopoly on it. But their high-priced, inefficient copy methods are rapidly giving way to more effective, modern ways that save money and are nearly standard in the SLA-centric world of cloud computing.
March 21, 2014
Obstacles to Scaling: Centralization
Want to build a scalable application? Use a scalable architecture. What's a scalable architecture? Simple. A scalable architecture is "shared nothing," an architecture in which nothing is centralized. This seems to be harder to achieve the "deeper" you go into the stack; many software architects still seem to like centralized databases and storage. It's sad: centralized database and/or storage are the most frequent cause of problems, both technical and financial, in the systems I see.

Scalability

Scaling is simple concept. As your business grows, you should be able to grow your systems to match, with no trouble. Linear scalability is the goal: 11 servers should be able to do 10% more work than 10 servers. Adding a server gives you a whole server's worth of additional capacity. With anything less, you don't have linear scalability.

This is what we normally enjoy with web servers, due to the joys of web architecture and load balancers.

Sadly, this is often not what we normally enjoy with databases, because of mindless clinging to obsolete practices and concepts.

Databases

Databases are a wonderful example of a tool that was invented to solve a hard problem and has created a lot of value — but has turned into a self-contained island of specialization that tends to cause more problems than it solves.

Databases are a Classic example of a Software Layer

Most people in software seem to think that having layers is a good thing. Software layers are, with few exceptions, a thing that is very, very bad! The existence and necessity of the layer tends to be accepted by everyone. It's so complicated that it requires specialists. The specialists are special because they know all about the layer and what it can do. They compete with other specialists to make it do more and more. Their judgments are rarely questioned. Sadly, they are wrong all too often both on matters of strategy and detailed tactics. All these characteristics of software layers apply to the database.

Database pathology is a classic result of the speed of computer evolution

Databases were invented by smart people who had a hard problem to solve. But the fact that they have persisted as a standard part of the programmer's toolkit, essentially unchanged, is a classic side-effect of the fact that computer speed evolves much more quickly than the minds and practices of the programmers who use them. This concept is explained and illustrated here.

How to fix the problem

There are a couple of approaches, depending on how radical you are.
- Fix the scalability problem by moving beyond databases
If you have the chance, you should do yourself and everyone else a favor and move to the modern age. As I show in detail here, the fierce speed of computer evolution has solved most of the problems that databases were designed to solve. The problem no longer exists! Get over it and move on!
- Fix the scalability problem by moving to shared nothing
If you're not willing to risk being burned at a stake for the heresy of claiming that a problem involving a bunch of data can be solved nicely without a database, there are almost always things you can do to fix the typical centralized database pathologies.

The desire to have all the data in a single central DBMS is strong among database specialists. This desire is what fuels the incredible amount of money that goes to high-end solutions like Oracle RAC. The desire is completely understandable. It's not unlike when a bunch of guys get together, bragging rights go to the one with the coolest car or truck.

However understandable, this desire is misguided, counter-productive and remarkably ignorant of fundamental DBMS concepts, like the difference between logical and physical embodiments of a schema. There is no question that there needs to be a single, central logical DBMS. But physical? Go back to database school, man! All you need to do is apply a simple concept like sharding, which in some variation is applicable to every commercial schema I've ever seen, and you've gone most of the way to the goal of a shared-nothing architecture, which gives you limitless linear scaling. Game over!

Analysis

Computers evolve far more quickly than software, which itself evolves far more quickly than the vast majority of programmers. There is nothing in human experience that evolves so quickly. This fact explains a great deal of what goes on in computing.

I've found that the more layers a given computer technology is "away from" the user, the more slowly it tends to change, i.e., the farther in the past its "best practices" tend to be rooted. In these terms, databases are pretty deeply buried from normal users, metaphorically many archaological layers below the surface. They are "older" in evolutionary terms than more modern things like browsers. Similarly, storage is buried pretty deep. That's why most of the people who devote their professional careers to them are mired in old concepts. If you think about it, you realize that DBMS and storage thinking strongly resembles thinking about those ancient beasts that used to rule the earth, mainframes!

Conclusion

Most software needs to be scalable. "Shared nothing" is the key architectural feature you need to achieve the gold standard of scalability, linear scalability. Shared nothing is common practice among layers of systems that are "close to" users, but relatively rare among the deeper layers, like database and storage. But by dragging the database function to within a decade or so of the present, and by applying concepts that are undisputed in the field, you can achieve linear scalability even for the database function, and usually save a pile of money and trouble to boot!
February 6, 2014
Cyber Security Standards are Ineffective against Insiders like Edward Snowden

The case of Edward Snowden, the fellow who ran off with a big pile of secrets from the super-secret NSA, illustrates a problem with the mainstream approach to computer security: it's expensive, it's burdensome, and it just doesn't work! Strengthening existing standard security measures, which is what usually happens after embarrassing episodes like this, will just make things worse.

Securing what should be secure

Other people can argue about what various agencies should or should not be doing and whether they should be secret. Putting all that aside, there are lots of things most of us want to be kept secret, for example our health and financial records, and for sure we want to prevent unauthorized use of that information. How hard is this to accomplish?

Apparently it's pretty hard. There are huge security compromises that take place all too often, and smaller ones with great frequency. Security breaches resemble car crash deaths: there are so many of them (tens of thousands a year in the US!), that only the most gruesome of them make the news. If an agency with a secret budget probably in the billions, whose whole mission is about secrecy, can't stop an amateur like Edward Snowden, how is it that anything stays secret?

Approaches to Security

The vast majority of our thinking about security threats makes a couple crucial assumptions.

Our thinking assumes that the threat comes from an outsider, and that the outsider attacks from the outside. The outsider (we think) probes to find a weakness in our defenses, and when he finds ones, smashes in and grabs what he wants.

Regardless of the source of the threat, we assume that we can establish a procedure that will thwart any breach of security. We assume that if we are rigorous in our requirements for process, documentation, testing and much else, we can eliminate security threats.

As the NSA case demonstrates, these assumptions are false. Regardless of your feelings about whether Snowden is a hero or a traitor, he clearly demonstrates the fact that our current approach to security is a waste of time.

Insiders are the real threat

The first assumption is the "bad guys out there" assumption. Huge amounts of money is spent on "intrusion detection," firewalls, and endless things that amount to building a castle wall that is high and thick so that our secrets can be protected.

Here's what happens. The marauding knights come sauntering along and see those high walls. Naturally they check it out. They're impressed by everything about your wonderful castle: the moat, the guards, the mean-looking guys on the ramparts, the whole bit. So if you were a sensible bad guy, what would you do?

You'd go to the nearest town, trade in your bad-guy clothes for a respectable suit or workman's clothes, or whatever the castle is looking to hire. Then you'd walk up to the employee entrance and apply for a job! Once you were inside, you'd keep your nose clean and figure out the lay of the land. Once you had it scoped, one day you'd leave at the end of your shift a much richer person than you were before, so rich that, well, you didn't bother to report to work at the castle any more.

I was first educated about this by Paul Proctor, who gave me a copy of his 2001 book, The Practical Intrusion Detection Handbook. Most of the book is about what people want to buy, which is based on the "bad guys are out there" theory. But he has a whole chapter on "host-based intrusion detection," in which he spells out the methods and importance of detecting and thwarting bad guys who have managed to get a job working for you. This is what everyone should be doing, and all these years later, we're not!

Tell me what to do, not how to do it!

The second assumption is that we can define step-by-step procedures that will prevent security breaches. Hah! Not true! The vast majority of our security procedures have been written by people who are lawyers; if they're not, they're sure acting like they are!

What we should do is tell you what to accomplish in simple terms, like "Don't murder anyone. No matter how mad or drunk you are, just don't do it. If you do, we'll execute you or put you in jail for a long time. So there." That's all you need, when you're telling someone what to accomplish.

The equivalent for HIPPA would be something like: "Don't give anyone's health records to anyone except that person or their designated representative, like a parent if they're a kid."

The equivalent for NSA would be: "Hey, everything we're doing here is real important stuff regarding national security, like what our name says. So don't let anyone who doesn't also work for NSA have it. Period. Ever. Otherwise, you're a traitor, and we'll nail you."

Instead, what companies and agencies are required to do is conform to an ever-growing collection of detailed methods for supposedly getting secure. Except you spend so much time conforming to the regulations that some guy walks out the door with all your secrets!

Here's the bad news: Snowden wasn't an exception; he's simply a particularly famous typical case in security-regulated organizations.

Conclusion

Edward Snowden is the tip of a security-breach iceberg. Credit cards are being stolen in spite of onerous security regulations. Health records are being compromised, in spite of increasingly onerous regulations. Our approach to security is flawed, fundamentally and by assumption. It's like we're in the water and we're trying to swim by blowing on the water. It's not working, and the solution is not to try blowing even harder. The solution is to take an aggressive, non-regulatory approach to the most likely perpetrators, insiders.

July 2, 2013
Is Your Site Working? Do You Really Care?
Like He-Who-Must-Not-Be-Named in the Harry Potter books, I find that there is something that must-not-be-named in the world of computing. It is That-Which-Must-Not-Be-Discussed. It’s not so much that people fear it (although many of them do), it is more that it is (for reasons that mystify me) incredibly low prestige. It’s kind of like maintenance or janitorial services in an office building. The higher prestige you are, the less likely you are to mention the subject – it is simply beneath notice. Why, people who wear uniforms, for goodness sake, do the work.

Nonetheless, this subject is incredibly high value. In fact, it’s hard to argue there is anything more valuable in the computing world. What is the subject? Well, think Haiti. Think Chile. Think Mexico. Think earthquake. Yes, earth-shaking, building-crashing, crevice-opening earthquake. What is the equivalent of an earthquake in the world of computing? You know, yes you do. If it’s a series of scary tremors, then it’s a site slow-down. If it brings down buildings, then it’s a site crash. If it brings down buildings and cars and people disappear into newly appeared holes in the ground, it’s a major outage with data loss.

Where does making buildings earthquake-proof stand in the overall priority of things?

It couldn’t have been too high on the priority list at RIM in the time leading up to the earthquake that struck them last December. Here is a representative story:

http://www.nytimes.com/aponline/2009/12/23/business/AP-US-TEC-Research-in-Motion-BlackBerry-Outage.html

What a horrible thing to happen to their business! Lots of free publicity of exactly the kind they don’t want.

The story reminded me of similar events, less public, that have taken place at a couple of companies I know very well. The story also led me to reflect on how data centers (and related development) issues are typically left to fester until they blow up. Then the alarms ring, everyone runs around, the immediate problem is fixed. What is unusual is for management to take the systemic action that is required to greatly reduce the chance of the failure recurring. Some data center operations resemble a coal-fired heating furnace – they require constant care and feeding, are cranky and don’t like change of any kind, but people just don’t want to think about it. “Upgrade to gas? I don’t have time to think about it. Maybe in next year’s budget.”

Here’s what I find: the more august the group of people, the higher their status, the less willing they often seem to be to devote real time, effort and brain cycles to That-Which-Must-Not-Be-Discussed. This is wrong. It is so wrong, it is perverse. Change your priorities! Take a look at it now (or at least soon), when (I hope) the alarm bells are not ringing.

Even though the articles about the RIM debacle don’t go into detail, it is reasonable to guess a couple things about the RIM operation from facts that have been revealed. Here are some of the warning signs, most of which applied to the RIM case, and some of which may or may not have. I list them here as a quick check-list to see how vulnerable your operation may be.
- Highly complex system. RIM’s data center was said by several people to be highly complex. This is almost always a bad sign. Systems naturally become complex over time (kind of like entropy), and sometimes smart people insist for plausible-at-the-time reasons on adding complexity. The trouble is that, for a variety of understandable reasons, the more complex a system is, the more likely it is to fail when changed. It is worth working hard to reduce the number of elements in your data center and generally make it simpler.
- Change management risk. The failure at RIM is said to have been a result of a software “upgrade.” This is one of the most common opportunities for embarrassment. All too often, people respond by reducing the frequency of change, which actually increases the chance that any one change will cause a disaster (because it is likely to be a larger, more complex change). There are methods of reducing this risk to near zero.
- “us vs. them.” Most data center disasters I have seen happen when there is a (typically well-intentioned) strict separation between data center operations and the rest of the world.
At minimum, it is worth a quick, objective look at the “machine room” of your operation to see if it looks and feels like the kind of place to which disasters are naturally attracted.

Above all, get over That-Which-Must-Not-Be-Discussed – yes (dare I say), be like Harry Potter – call Voldemort by his proper name! Talk about earthquake vulnerability! And above all, do what you have to do so that, when things go wrong, your service keeps working.
April 6, 2010
Paleolithic Mainframes Discovered Alive in Data Center!

A recent article on forbes.com quotes me on what many people find to be the surprising longevity of mainframe computers.

Don’t things in computing just get better and better – not to mention faster, smaller and less expensive? Which implies that after a few years of use, it’s just not worth keeping the old stuff around anymore? So we throw out (oops, please excuse me, we meticulously recycle…) the useless old stuff and bring in the cool, cost-effective new stuff, right?

Like most common wisdom in modern computing, this contains elements of truth, but isn’t quite right.

The element of truth behind this thought is the astounding continued progress of Moore’s Law, which posits that electronics gets smaller and faster at a rate that boggles the mind. This is what gives us iPhones and portable computers that have more speed and capacity than the room-sized mainframes of the past.

But there is more to computing than electronics. First of all, there is this little matter of physical devices that have mass and inertia, that no amount of Moore’s-Law-driven advance will free from the tyranny of the laws of physics. This leads to growing storage problems that Moore’s Law actually makes worse. See here for a description of the issue.

Second of all, there is this thing called software. Yes, software, the invisible-to-the-human-eye “stuff” that makes all that amazing electronics actually do something. Software is really hard, complicated stuff, like most things that are essentially mental, conceptual and invisible (think math). Once some software actually gets working well enough, sensible people are loath to change it. Even worse, the amazing increases in speed and capacity of electronics mask simply awful problems in software.

Building most real, practical production software tends to be a nightmare that rarely ends. Re-building software that more-or-less works is a nightmare in hell that visits all the circles of hell in round-robin. So if the credit card companies can process their transactions, and the software that gets the job done happens to be written in totally-out-of-fashion-squared COBOL that runs best on a mainframe – that’s a great reason for IBM to build a new implementation of the mainframe instruction set out of modern electronics (thus getting most of the benefits of all the advances), just so it can run the code. It’s kind of like a horse and buggy built out of modern materials and powered by a fuel cell – it looks funny, but it’s modern and efficient and gets the job done.

So, yes, the electronic part of computers get faster, better and cheaper. And the software seems to get better because it’s along for the ride, but it actually tends to get worse, which is why Paleolithic mainframes have been discovered, alive and working, in otherwise modern data centers.

January 26, 2010