The Black Liszt

Category: Database

Occamality in Databases

The accepted practice of database schema design is a good application of Occamal principles. In fact, in a broad sense, Occam optimality takes the concepts accepted in schema design and applies them to programs and the software lifecycle as a whole.

Someone at a bank would write a system for automating checking accounts. Naturally, the account information would include the name and address of the person who owns the account. Some else would write a system for automating savings accounts. Naturally, the account information would be essentially the same; it may be the same because someone knew all about the checking system and copied it, or it might be the same because of “convergent evolution.” The account holder moves, and somehow tells the bank. The person at the bank has to look everywhere the person’s address might be stored and change it there. They might miss a place, or they may enter something incorrectly. This places a burden on the bank, makes extra work, is a source of errors, and may lead to customer dissatisfaction. It’s far better to have a single place where a person is identified at the bank, and where the things we know about that person (name, address, phone number, etc.) are stored. In the database world, this is the concept of “normalization,” which essentially means store each piece of unique data exactly once.

The database world also has the concept of “reference” in a couple of ways. If the person actually sets up a checking account, the checking account system needs the account information for the customer. This is done by identifying each account by a unique identifier, known as a “primary key.” The primary key is associated with the master copy of the account-holder’s information. Then, in each place where the customer uses a bank service, for example a checking account, a “foreign key” to the account information is placed. This is actually a copy of the primary key. So if you are customer 123, your primary key is 123, and the number 123 is also made the foreign key for your savings account, your checking account, etc. Calling it a “foreign key” says that it’s a reference to the one place where the information is stored, and not the master copy itself.

The database also uses references to eliminate redundancy in data types. When you need multiple fields that have to be distinct but actually represent the same kind of data, you define the type information once in something called a “domain,” and then each use of the domain is actually a reference to the master definition. If you change the domain, all the uses of the domain automatically change.

Databases represent Occamal principles in another important respect: statements in the data manipulation language (today, that means statements in SQL, for example SELECT data FROM tables WHERE conditions) are limited to what needs to be done, not how it is to be done. For example, if there is a JOIN between two tables, which should be used first? There is an excellent answer to that question, for example if one table is much smaller than the other, use the smaller table, etc. By abstracting how the join is implemented from the fact that you want a join, you get to define the how of joins in exactly one place, while requesting joins in many places. In earlier data manipulation languages, you could get the job done, but you would explicitly go first to the table you thought best to start from, and then go to the next. The trouble is, this puts the knowledge of how to navigate tables to get information in many places! This is bad because everyone has to learn it and apply it correctly, and if you want to change the method because you’ve figured out a better way to do it, you have to go through all the DML in the program and make individual decisions about how and what to change. Separating what we want from how to get it was definitely an advance, because now SQL could benefit from ever-improving execution routines without having to be changed! The point here is that exactly those benefits are the ones we would expect, because the change also made all programs that use SQL closer to being Occam-optimal!

It is true that many database implementations are Occam-suboptimal when considered in isolation. Because DBMS schemas are typically completely isolated from the programs that use them, programs that use databases (taking the database schema and stored procedures to be part of the program) are typically far from Occam-optimal. This is recognized in the RAILS framework of the Ruby language, which unifies the data definitions of the database and the program; this is the reason why programming is so much more efficient when using RAILS. Apart from RAILS, it is wonderful to have modern databases as an example of an island of software in which Occamal principles are accepted and valued, and the benefits widely enjoyed. By understanding the examples of good database practice in terms of Occam optimality, we can get an idea of how the principle can be extended to the rest of software.

September 27, 2023
Software Programming Language Evolution: Impact of the DBMS Revolution

The invention and widespread acceptance of the modern database management system (DBMS) has had a dramatic impact on the evolution and use of programming languages. It's part of the landscape today. People just accept it and no one seems to talk about the years of disruption, huge costs and dramatic changes it has caused.

The DBMS Blasts on to the Scene

In the 1980’s the modern relational database management system, DBMS, blasted onto the scene. Started by an IBM research scientist, E.F. Codd and popularized by his collaborator Chris Date, The Structured Query Language System, SQL, changed the landscape of programming. Completely apart from normal procedural programming, you had a system in which data could be defined using a Data Definition Language, DDL, and then created, updated and read using SQL The data definitions were stored in a nice, clean format called a schema. Best of all, the new system gave hope to all the people who wanted access to data who couldn’t get through the log jam of getting custom report programs written in a language the analysts didn’t want to be bothered to learn. SQL hid most of the ugly details because of its declarative approach.

SQL was more than just giving access to data. There was a command to insert data into the DBMS, INSERT, then make changes to data, UPDATE, and even to send data to the great bit-bucket in the sky, DELETE. The system even came with transaction processing controls, so that you could perform a deduction from one user's account and an addition to another user's account and assure that either they both happened or neither did. Best of all the system did comprehensive logging, making a permanent record of who did what changes to which data and when. Complete and self-contained!

This impressive functionality led to a problem. With users demonstrating outside the offices of the programming department, things were getting rowdy. The chants would go something like this:

Leader: What do we want?

Shouting crowd: OUR DATA!

Leader: When do we want it?

Shouting crowd: NOW!

Everyone wanted a DBMS. They wanted access to their data without having to go through the agony of coming on bended knee to the programming department to get reports written at some point in the distant future.

The response of languages: what could have happened

It wouldn't have been difficult for languages to give the DBMS demonstrators what they wanted with little disruption. One possibility was changing a language so that it produced a stream of data changes for the DBMS in addition to its existing data changes. A second possibility was changing a language so that its existing data statements would be applied directly to a DBMS. Either of these alternatives would have supplied a non-disruptive entry of DBMS technology into the computing world. But that's not what happened.

I personally implemented one of these non-disruptive approaches to DBMS integration in the mid-1980's and it worked. Here's the story:

I was hired by EnMasse Computer, one of several upstart companies trying to build computers based on the powerful new class of microprocessors that were then emerging. EnMasse focused on the commercial market which was at the time dominated by minicomputers made by companies like DEC, Data General and Prime. Having a DBMS was considered essential by new buyers in this market, but most existing applications were written in languages without DBMS support. One of my jobs was to figure out how to address the need. I was told to focus on COBOL.

This was a big problem because the way data was represented in COBOL didn't map well into relational database structures. What I did was get a copy of the source code of our chosen DBMS, Informix, and modify it so it could directly accept COBOL data structures and data types. I then went into the runtime system and modified it to send native COBOL read, write, update and delete commands directly to the data store, bypassing all the heavy-weight DBMS overhead. This was tested and proven with existing COBOL programs. It worked and the COBOL programs ran with undiminished speed. The net result was that unmodified COBOL used a DBMS for all its data, enabling business users full access to all the data without programming.

I did all the work to make this happen personally, with some help from an assistant.

I thought this was an obvious solution to the problem that everyone would take. It turns out that EnMasse failed as a business and that no one else took the simple approach that was best for everyone.

The response of languages: what did happen

What actually happened in real life was a huge investment was made with widespread disruption. Instead of burying the conversion, massive efforts were taken to modify — practically re-write — programs written in COBOL and other languages so that instead of using their native I/O commands they used SQL commands instead, with the added trouble of mapping and converting all the incompatible data structures. More effort went into modifying a single program for this purpose than I put into making the changes at the systems level to make the issue go away. What's worse is that, because of the massive overhead imposed by DBMS's for data manipulation using their commands instead of native methods performance was degraded by large factors.

While the ever-increasing speed of computers mitigated the impact of the performance penalty, in many cases it was still too much. In the late 1990's after massive increases in computer power and speed, program creators were using the stored procedure languages supplied by database vendors to enable business logic to run entirely inside the DBMS instead of bouncing back and forth between the DBMS and the application. While this addressed the performance issue of using DBMS for storage, it introduced having the logic of a business application written in two entirely different languages running on two different machines, usually with different ways of representing the data. Nightmare.

The rise of OLAP and ETL

One of the many ironies of the developments I've described is that people eventually noticed that the way data was organized for sensible computer program use was VERY different from the best ways to organize it for reporting and analysis. The terms that emerged were OLTP, On-line Transaction Processing, and OLAP, On-Line Analytical Processing. In OLTP, it's best to have data organized in what's called normalized form, in which each piece of data is stored exactly once in one place. This makes it so that a program doesn't have to do lots of work when, for example, a person changes their phone number; the program just goes to the one and only place phone number is stored and makes the change. OLAP is a different story because there's no need to update data that's already been created — just add new data.

There were also practical details, like the fact that data was stored and manipulated by multiple programs, many of which had overlapping data contents — for example a bank that has a program to handle checking accounts and a separate program for CD's, even though a single customer could have both. This led to the rise of a special use of DBMS technology called a Data Warehouse, which was supposed to hold a copy of all a system's data. A technology called ETL, Extract Transform and Load, emerged to grab the data from wherever it was first created, convert it as needed and stored it into a centralized place for analysis and reporting.

Given the fact that you really don't want people sending SQL statements to active transaction systems that could easily drag down performance and all the factors above, it turns out that the push to make normal programs run on top of DBMS systems was a monstrous waste of time. One that continues to this day!

Conclusion

Nearly all programmers today assume that production programs should be written using a DBMS. While alternatives like noSQL and key-value stores have emerged, they don't have widespread use. Since the data structures used by programs are often very different than those used by the DBMS, a variety of work-arounds have been devised such as ORM's (Object-Relational Mappers), each of which has a variety of performance and labor-intensive issues. The invention and near-universal use of relational DBMS in software programming is a rarely recognized disaster with ongoing consequences.

January 26, 2021
Obstacles to Scaling: Centralization
Want to build a scalable application? Use a scalable architecture. What's a scalable architecture? Simple. A scalable architecture is "shared nothing," an architecture in which nothing is centralized. This seems to be harder to achieve the "deeper" you go into the stack; many software architects still seem to like centralized databases and storage. It's sad: centralized database and/or storage are the most frequent cause of problems, both technical and financial, in the systems I see.

Scalability

Scaling is simple concept. As your business grows, you should be able to grow your systems to match, with no trouble. Linear scalability is the goal: 11 servers should be able to do 10% more work than 10 servers. Adding a server gives you a whole server's worth of additional capacity. With anything less, you don't have linear scalability.

This is what we normally enjoy with web servers, due to the joys of web architecture and load balancers.

Sadly, this is often not what we normally enjoy with databases, because of mindless clinging to obsolete practices and concepts.

Databases

Databases are a wonderful example of a tool that was invented to solve a hard problem and has created a lot of value — but has turned into a self-contained island of specialization that tends to cause more problems than it solves.

Databases are a Classic example of a Software Layer

Most people in software seem to think that having layers is a good thing. Software layers are, with few exceptions, a thing that is very, very bad! The existence and necessity of the layer tends to be accepted by everyone. It's so complicated that it requires specialists. The specialists are special because they know all about the layer and what it can do. They compete with other specialists to make it do more and more. Their judgments are rarely questioned. Sadly, they are wrong all too often both on matters of strategy and detailed tactics. All these characteristics of software layers apply to the database.

Database pathology is a classic result of the speed of computer evolution

Databases were invented by smart people who had a hard problem to solve. But the fact that they have persisted as a standard part of the programmer's toolkit, essentially unchanged, is a classic side-effect of the fact that computer speed evolves much more quickly than the minds and practices of the programmers who use them. This concept is explained and illustrated here.

How to fix the problem

There are a couple of approaches, depending on how radical you are.
- Fix the scalability problem by moving beyond databases
If you have the chance, you should do yourself and everyone else a favor and move to the modern age. As I show in detail here, the fierce speed of computer evolution has solved most of the problems that databases were designed to solve. The problem no longer exists! Get over it and move on!
- Fix the scalability problem by moving to shared nothing
If you're not willing to risk being burned at a stake for the heresy of claiming that a problem involving a bunch of data can be solved nicely without a database, there are almost always things you can do to fix the typical centralized database pathologies.

The desire to have all the data in a single central DBMS is strong among database specialists. This desire is what fuels the incredible amount of money that goes to high-end solutions like Oracle RAC. The desire is completely understandable. It's not unlike when a bunch of guys get together, bragging rights go to the one with the coolest car or truck.

However understandable, this desire is misguided, counter-productive and remarkably ignorant of fundamental DBMS concepts, like the difference between logical and physical embodiments of a schema. There is no question that there needs to be a single, central logical DBMS. But physical? Go back to database school, man! All you need to do is apply a simple concept like sharding, which in some variation is applicable to every commercial schema I've ever seen, and you've gone most of the way to the goal of a shared-nothing architecture, which gives you limitless linear scaling. Game over!

Analysis

Computers evolve far more quickly than software, which itself evolves far more quickly than the vast majority of programmers. There is nothing in human experience that evolves so quickly. This fact explains a great deal of what goes on in computing.

I've found that the more layers a given computer technology is "away from" the user, the more slowly it tends to change, i.e., the farther in the past its "best practices" tend to be rooted. In these terms, databases are pretty deeply buried from normal users, metaphorically many archaological layers below the surface. They are "older" in evolutionary terms than more modern things like browsers. Similarly, storage is buried pretty deep. That's why most of the people who devote their professional careers to them are mired in old concepts. If you think about it, you realize that DBMS and storage thinking strongly resembles thinking about those ancient beasts that used to rule the earth, mainframes!

Conclusion

Most software needs to be scalable. "Shared nothing" is the key architectural feature you need to achieve the gold standard of scalability, linear scalability. Shared nothing is common practice among layers of systems that are "close to" users, but relatively rare among the deeper layers, like database and storage. But by dragging the database function to within a decade or so of the present, and by applying concepts that are undisputed in the field, you can achieve linear scalability even for the database function, and usually save a pile of money and trouble to boot!
February 6, 2014
“Big Data:” Some Little Observations
"Big Data" is everywhere. If only because of this, it is important, like the way Paris Hilton

is famous for being famous.

What's included in "Big Data?"

If your concern is storing, serving or transmitting it, you don't care what kind of data it is — data is data, a pile of bits.

But not all data is created equal. The easiest way to understand this is to break all the bits into relevant buckets. By far, the largest bucket is for image data (including both still and moving pictures, videos). While the ratios vary, it's not unusual for there to be 100 bits of image data for each bit of other data.

While there's not a commonly accepted terminology, all the rest of the data can be understood as "coded" data. This again falls into two categories. The larger portion is "unstructured" data, things like documents, blogs, e-mails and most web pages (except for the images and videos on them). The smaller portion is "structured" data, which includes all databases, forms and anything else that can show up in a report.

When people talk about "big data," they could be talking about any of the above, but mostly people talk about it because they want to extract actionable information from it, and the source of most actionable information is structured data. So in the vast majority of cases, when people talk about "big data," they're talking about structured data.

Did Data used to be Small and Now it's Big?

Think about a bank statement. There's a little information about you at the top, but most of the statement is probably taken up by the transactions — money moving into and out of the account. In general terms, this is the action log, the transaction history. This pattern of having an account master and detail records is a common one.

Now think about a web site. The site itself is like the bank statement, and the record of people visiting and intereacting with it is like the transaction history, generally known as a web log.

People generate far more transaction records when interacting with
the web than other human activities; for example, you probably click on
hundreds of pages for each bank transaction you make. So the amount of data can be pretty big.

The simple answer is: before the web, transaction data wasn't very big, and with the web, there's a lot more of it than there was before. Of course big data isn't just about the web; but the web has certainly gotten people to pay attention.

So where did "Big Data" come from?

It would be interesting to do a cultural history, but I suspect that the current interest in "big data" stems from the following factors:
- Companies that pay attention to web logs get information about visitor behavior that can be used to make more money.
- Internet advertising companies have done exactly this for years, and are getting really good at it.
- Shockingly, most people don't analyze their data to improve their behaviors.
- A closed loop system in which the results of your actions are used to enhance future actions is the clear winning strategy.
- This requires (gulp) collecting and analyzing the relevant data, which is far larger than most people are used to dealing with.
Thus the term "big data," which currently applies to just about any body of transaction data.

What's "Big" about "Big Data?"

Let's start by applying one of the fundamental concepts of computing to the question: counting. One of the first disk drives I got to use was a twelve inch removable pack developed by IBM:

Its capacity was about 1MB. While that may sound small by today's standards, let's put it in perspective. Each byte is the equivalent of a character that you can type. Using a generous measure of 30 wpm and 5 cpw, that's 9,000 characters in an hour of continuous typing with no breaks, so the disk above has a capacity of more than 100 hours of continous typing. That's one reason I thought the disk's capacity was huge — it easily held the source code for the FORTRAN compiler I wrote at the time, which was about a year's worth of work!

Now let's get modern. Drives have gotten smaller while holding more and more. Here's a good visualization of the progression:

We're now at the point where truly small drives (1 to 2.5 inches) hold massive amounts of data; 1TB or more is common.

How much is that? Remember, it would take 100 hours of continuous typing to fill up the large disk pictured earlier. How much space would those drives fill if you had 1TB to store? That's about 1 million of the older disks; if you packed them tightly, they would fill a room that was about 100 feet long, 100 feet wide and 10 feet high. And I would have to type for 100 million continuous hours to fill them up. Now, that's big data.

Now that we've got a sense of how big a TB is, let's get real.

On a good day, this blog might have 100 page views, each generating a server log record. Such records vary in length, but let's say they average 100 bytes in length each, or 10K bytes a day. Not much.

Let's say I caught up to the Washington Post, a site which is in the top 100 in the US. It gets about 1 million page views a day. That would be a mighty 100 million bytes a day of raw server log data. 10 days would add up to a GB of data, which means that ten thousand days, about 30 year's worth of data would fit on one of those physically little drives pictured above that holds just 1TB of data.

The Washington Post is a major site; top 100. Their web transaction logs are the biggest data for analysis they've got. And here's what 30 year's worth of their data will fit on:

That's what they call "big data." This is why I instinctively drop into cynical mode when the subject of "big data" comes up. It just isn't usually very big!

How much data do you need?

It depends on context. If you're a website like Facebook offering a free service holding user's data, the answer is simple: you keep as much of the user's data as you feel like. You can (and if you're Facebook, regularly do) throw out data any time you feel like it, or just drop it on the floor and lose it because your programmers weren't up to dealing with it.

If you're a money-making business that depends on data, you could probably run your business better if you
1. Kept all the data
2. Analyzed it
3. Came up with useful observations, and
4. Changed your behaviors accordingly.
But most businesses don't do this very well, if at all. And they are feeling increasingly guilty about it. Thus the marketing drum-beat for selling everything that can possibly be labelled "big data."

Sarcasm aside, the fact is that most businesses don't need much data in order to perform wonderfully useful analyses. The reasons are simple:
- The things that matter the most are things you're not doing yet. The data you've got is historic. It's like if you're a comedian and the audience doesn't laugh much; no amount of big data analysis of audience reaction will help you come up with better laughs.
- The impact of big potential changes will be seen in lots of your data. Go back to statistics 1.01. How much data do you need to see that the coin you're flipping isn't a fair one? Only enough to prove that the 2 out of 3 times it comes up heads isn't a fluke.
- In the end, how many changes can you realistically make? Hundreds? How about rank ordering them, finding the most important ones first, then moving on from there? You'll quickly get to diminishing returns.
Finally, more important than anything else, is getting into an experimental, data-driven, closed-loop system. This is always the key to success. It how organizations become successful, get more successful, recover from trouble, and stay on a winning path.

Conclusion

For better or worse, "big data" is likely to be with us for awhile, at least as a technology fashion trend. Like all such fashion trends, it's a useful occasion for getting us all to check if we're putting our transaction data to its most optimal use in keeping us on the track we're on and getting us onto improved ones.
February 4, 2013
I’m Tired of Hearing about Big Data

I've already complained about the so-called Cloud, and I thought that would get it out of my system, but I've had it — I'm up to here — with blankety-blank "big data."

"Big Data" at the Harvard Library

The Harvard Library system is REALLY BIG. Here's Widener Library:
Widener has 57 miles of bookshelves with over 3 million volumes. But that's just the start. The Harvard University Library is the largest university library system in the world. There are more than 70 libraries in the system beyond Widener, holding a total of over 12 million books, maps, manuscripts, etc.

Here's the news: Harvard is putting it on-line! Well, not the actual books; there are little details like copyrights. But the metadata, about 100 attributes per object. According to David Weinberger, co-director of Harvard's Library Lab: "This is Big Data for books." A blog post described a day-long test run during which 15 hackers worked on a subset consisting of 600,000 items and produced various results.

Pretty amazing, huh? That's at least a couple miles of books!

"Big Data?" Give me a Break!

Apparently, no one does simple arithmetic anymore. Maybe it's the combined impact of reality TV shows, smartphones, global warming and Twitter. Who knows?

How much data does Harvard have? 12 million objects each with 100 attributes is 1.2 billion attributes. When I first started thinking about this, I gave generous estimates of the size of each metadata attribute. Then I got skeptical, dug a little, and found the actual data set. As of this month, there are 12,316,822 rows in the data set, with a compressed data size of 3,399,017,905. In case that appears to be a big number to you, it's less than 4GB.

4GB. Your smart phone probably has more than that, and your iPad certainly does. The laptop computer I'm using right now has more than that. Yes, yes, the data is compressed. How much will it be uncompressed? I could find out, but I'm lazy, and the answer in the very worst case is going to be about the same: a very small amount of data.

"Big Data" usually isn't

The reason I'm really tired of hearing about Big Data is that, in the vast majority of cases, it isn't big. Not only isn't it big, it's usually kinda small. So the people who talk about it as "Big Data" are either stupid or they're liars. Either of which make me irritated.

My acid test for whether a data set can't possibly be described as "Big Data" without embarassment or shame is whether it fits into the memory (as in the RAM-type memory, forget about disk) of a machine you can buy off the web for less than $15,000. These days, that's a machine from (for example) Dell that has about 256GB.

This machine from Dell will fit more than 50 copies of the Harvard Library data set. Who cares how big it is when uncompressed? Loads of copies of it will still fit entirely in memory.

Do not say "Hadoop" or "clusters." If you do, I swear I'll slap you.

Conclusion

The conclusion is obvious. Don't even start thinking the phrase "Big Data" unless you've first applied some common sense and performed some simple arithmetic. Definitely don't utter those words around me without first having applied the sobering tonic of arithmetic. And wake up and smell reality: "Big Data" and "Cloud" are little more than words that vendors use to trick unsuspecting victims into spending money on cool new things.

Then, of course, maybe you really do have big data — like objectively, unusually BIG data. That's cool. It's easier to work with and get value out large data sets than ever before, and I hope you can use the latest, most productive methods for doing so. Some of the Oak companies are doing just that, and it's exciting.

April 30, 2012
Databases and Applications

When databases were invented, they solved a huge problem that couldn't be solved any other way. Anyone who cares to look can see that the original problem that caused us computer programmers to invent databases has largely gone away. So why is it exactly that application programmers reflexively put their data in a database? In a surprisingly wide range of cases, it sure isn't because of necessity. Could it, perhaps, perhaps, be nothing but habit and the little-discussed fact that change happens in software at roughly the same rate that change happens in glaciers?

From the beginning of (computer) time, instructions have needed to be in memory to be executed, and data has needed to be in memory to be operated on by instructions. The memory in which instructions execute and fiddle with data has always been way faster and way more expensive than the large, slow but cheap places they are put when they're not in memory (call it storage, whether the storage is punch cards or tape or disk). It was this way at the beginning of time and it's true now.

Think of memory as your work table. Eons ago, your work table was really tiny, like this:

You can hardly fit anything at all on it! So you'd better have a really big storage place to keep all your stuff, like a pantry:

OK, that's cool. You've got all your stuff in storage, but you can only work on when it's in memory (on the table). What do you need? You need to get the stuff you want to work on now from the pantry, and you need to put the stuff you're done with back in the pantry. In other words (if you're in the world of computers)…you need a database!

The very most basic function of a database is pretty simple: its job is to shuffle your data between memory and disk. It's also nice if it keeps everything straight, avoids dropping bits on the floor, and cleans everything up when something goes wrong during the shuffling.

That was then. But things have changed. Remember Moore's Law? The amount of memory available to us at surprisingly reasonable prices has grown hugely. Exponentially. Our work tables now look more like this:

And our pantries? Well, they've grown a bit too:

So how much data do you have? Run the numbers. It goes without saying that it's going to fit in storage. But how about that work table? Here's the question you have to think about:

Will all your data fit on the work table (i.e., in memory)?

With memory available at reasonable prices for 64GB and even 256GB, the answer is often YES! It will!

Hmmmm. What was it the database does? If all my data fits in memory, why was it I needed that database???

I know, I know. Databases can be nice for reporting and data analysis and "persistence" and a few other things. I'm not saying you never use them. But for your real application, the one that takes user requests and responds to them, if you don't need to have the database shuttling stuff between the work table and the pantry… Hmmmm.

September 16, 2010
Databases, Logical and Physical
Everyone seems to assume that having a central database is a good idea. Well, I’m sorry, but “everyone” is wrong. Dead, flat-out WRONG. Get over it.

Now (if you’re still with me), before you conclude that I’m insane or just stupid, let me point out the applicability of a concept that’s been around in computing, oh let’s say, more than half a century – the difference between “ logical” and “physical.” Generally speaking, we get things done in the world of computers by concentrating on the logical world. Then we apply a logical-to-physical mapping, and that makes things work in the real world. We do this all the time. Practically everything in computing involves layers and layers of logical to physical mappings – you get layers because the “physical” layer you thinking you’re mapping to often itself turns out to be logical, and needs mapping again to another layer, and so on recursively.

However familiar we all are with logical mappings, and however much we depend on them in our daily lives, we also work in a world of assumptions. There are a set of practices we have grown up with or inherited, and we simply continue them. How could it be otherwise? If you spend too much time questioning assumptions, you’ll be paralyzed and never get anything done.

The “central database” and how to implement it is one of those deep-down assumptions. There has got to be one reality, one SSN per customer, one balance. This is true and good. I accept it, as a concept that applies to a database at the logical level. But at the physical level?? Exactly one physical copy of the data in a single physical database (not counting backups, etc.)??? Hey, remember that thing we do all the time, that logical-to-physical mapping thing? How about one logical copy of the data in a single logical database, embodied in as many physical databases as are required to get the job done…now, there’s an idea…seems promising, yes?

So why is, when I walk into some data centers, there is a huge, company-existence-threatening drama about the – get this – one physical database??!!
It’s so big, it’s slowing down, it’s freezing up almost every day now, what can we do? We’ve already (pick your favorite):
- “upgraded” to Oracle
- “upgraded” to Oracle RAC
- “upgraded” to Sun/HP/IBM mega-servers
- “upgraded” to ultra-expensive super-storage
Whine, whine, we’ve spent all this money, “everyone” said we were doing the right thing, now the problem is back, I’m going to lose my job, customers are leaving and/or threatening, my business is under siege, I’ve had the “best” people on the case for months, I’m still in trouble, what can I do???
Go back and re-read the opening paragraph, the one about central databases being bad – unequivocally, supremely, annoyingly BAD. Remember that paragraph? Now, re-read it in two alternate versions, remembering the concept that pervades your everyday existence in computing, logical-to-physical mappings:
- Having a central logical database is a good (as in: terrific, why would you have it any other way) idea
- Having a central physical database is a bad (as in: awful, what are you, stupid) idea
Any discussion? Any questions? Everyone cowed or shamed into submission? Good.

Now, if you’re not already in the good place, stop reading stupid sarcastic blogs and go get it done!! On the other hand, if you are already in the good place, I hereby give you permission to wallow in glorious feelings of superiority and smug satisfaction for a little break before you go back and do something useful.
February 25, 2010