Category: Big Data

  • Summary: AI Machine Learning Big Data Math Optimization

    This is a summary with links to my posts about Big Data and the fancy algorithms and math that are variously called Artificial Intelligence, Cognitive Computing, Machine Learning, Big Data Analytics, math optimization, etc. I have worked as an insider in these fields for over fifty years.

    Huge amounts of time and money are spent on these things, but getting practical, real-world results is not so easy. You can make things worse.

    https://blackliszt.com/2023/03/making-things-worse-with-ai.html

    How do you make things better? It starts with data.

    https://blackliszt.com/2018/03/getting-results-from-ml-and-ai-1.html

    Everyone knows that data is important. There’s even a field called Big Data.

    What people call Big Data often isn’t nearly as big as people think.

    https://blackliszt.com/2013/02/big-data-some-little-observations.html

    Do the arithmetic!

    https://blackliszt.com/2012/04/im-tired-of-hearing-about-big-data.html

    For all the talk about Big Data, it’s clear that making sure that the data is accurate and complete is too much to ask.

    https://blackliszt.com/2019/03/nobody-cares-about-data.html

    Among other things, the status hierarchy in Data Science makes it clear that anyone with ambition needs to get as far away from the actual data as possible.

    https://blackliszt.com/2019/03/the-hierarchy-of-software-skills-data-science.html

    Without data integrity, your analytics is screwed. It’s worse when the data is about your health. The integrity of data in healthcare electronic medical records is a major issue.

    https://blackliszt.com/2016/06/healthcare-innovation-emrs-and-data-quality.html

    Even simple insurance provider network data is too often wrong, putting yet more obstacles between patients and the services they need.

    https://blackliszt.com/2017/01/my-cat-taught-me-about-the-state-of-healthcare-provider-data.html

    Even if all the data were perfect, no one seems to ask why exactly is big data better?

    https://blackliszt.com/2015/07/fatal-flaws-of-big-data.html

    Is Big Data delivering results? Here’s what happened with Big data and Hadoop at Yahoo.

    https://blackliszt.com/2019/01/using-advanced-software-techniques-in-business.html

    With all the money going to Big Data, people are ignoring the massive benefits of leveraging little data.

    https://blackliszt.com/2016/08/little-data-vs-big-data.html

    The few people who study computer history have noticed that Big Data is remarkably similar to what used to be called EDW, Enterprise Data Warehouse.

    https://blackliszt.com/2015/10/big-data-and-data-warehouses.html

    Same thing with a new name. Hmm. Could it possibly be that Big Data is just a fashion trend?

    https://blackliszt.com/2013/02/the-big-data-technology-fashion.html

    https://blackliszt.com/2015/03/big-data-the-driving-force-in-computing.html

    At least for insiders, there's lots of humor to be found in data. There's even a book of the fun highlights.

    https://blackliszt.com/2022/01/data-humor-book-by-rupa-mahanti.html

    After data, you move to the amazing variety of techniques available, each of which has specific areas of applicability.

    https://blackliszt.com/2018/04/getting-results-from-ml-and-ai-2.html

    Even a narrow-sounding field like Machine Learning has incredible variety and lots of things you need to do until the algorithm can be effective.

    https://blackliszt.com/2017/02/learning-machine-and-human.html

    The simple concept of closed loop is essential to getting and keeping good results, but isn’t applied as often as it should be.

    https://blackliszt.com/2018/04/getting-results-from-ml-and-ai-3-closed-loop.html

    The application of the principles to healthcare has proven to be challenging, with high-profile failures. But there are successes.

    https://blackliszt.com/2018/08/getting-results-from-ml-and-ai-4-healthcare-examples.html

    I've discussed the application of the principles to fintech, with a focus on anti-fraud. The current leader in the field displaced the incumbent who was using neural network technology. Here is the background on HNC, which crowed about how it used neural networks to catch credit card fraud. Which it did. Sorta.

    https://blackliszt.com/2016/06/how-blockchain-will-deliver-value.html

    Here's the story of how the absolute winner of credit card fraud detection was displaced by a superior algorithm and concentrated on more relevant data.

    https://blackliszt.com/2019/12/getting-results-from-ml-and-ai-5-fintech-fraud.html

    Natural language AI has proved notoriously hard to make practical. Here’s an example of how it’s been done in fintech customer support.

    https://blackliszt.com/2019/12/getting-results-from-ml-and-ai-6-fintech-chatbot.html

    All too often, AI research is conducted in isolation. There is a great deal to do to assure that any results that are achieved can be integrated into production systems.

    https://blackliszt.com/2022/06/how-to-integrate-ai-and-ml-with-production-software.html

    Given all the issues in making AI work in practical reality, how likely is it that generative AI, the hot fashion in 2024, is likely to decimate the job market?

    https://blackliszt.com/2024/07/how-many-jobs-will-ai-eliminate.html

    As is true in most fields, in healthcare the money and attention tends to go to expensive, fancy methods requiring PhD’s instead of simple things that actually work.

    https://blackliszt.com/2015/07/cognitive-computing-and-healthcare.html

    https://blackliszt.com/2016/09/healthcare-innovation-from-washing-hands-to-ai.html

    https://blackliszt.com/2016/05/healthcare-innovation-can-big-data-and-cognitive-computing-deliver-it.html

    You don’t need AI or cognitive computing to discover or promulgate the new discoveries that humans make.

    https://blackliszt.com/2015/08/human-implemented-cognitive-computing-healthcare.html

    There are no technical obstacles to having computers do what doctors do. There is a clear path to getting it done. But the people in charge can't or won't make it happen.

    https://blackliszt.com/2025/01/ai-can-automate-what-doctors-do.html

    There are barriers to medical innovation, but there are terrible risks.

    https://blackliszt.com/2025/02/can-ai-improve-medical-diagnosis.html

    And then there's generative AI, which spews out expert-sounding language like a doctor. Is it better?

    https://blackliszt.com/2025/07/will-ai-give-better-healthcare-advice.html

    https://blackliszt.com/2025/07/chatgpt-covid-vaccine.html

    There are many ways to spend lots of time and not get practical results.

    https://blackliszt.com/2019/03/the-hierarchy-of-software-skills-data-science.html

    Even when an optimization technique is perfected and proven in practice at scale, it can take decades for it to be used in other relevant fields.

    https://blackliszt.com/2019/08/the-slow-spread-of-linear-programming-illustrates-how-in-old-vation-in-software-evolution-works.html

    Impressive analytics performed by recognized experts are subject to getting the results the Expert wants.

    https://blackliszt.com/2017/04/big-datas-big-face-plant.html

    Corruption of data or process can lead to bad results anywhere.

    https://blackliszt.com/2019/10/surprising-bugs-at-amazon-shows-how-ai-can-lead-to-disaster.html

    For cynical definitions in the tradition of Ambrose Bierce of Big Data, Machine Learning, Deep Learning, Cognitive Computing and AI, see these.

    https://blackliszt.com/2017/01/devils-dictionary-for-21st-century-computing.html

    https://blackliszt.com/2017/01/devils-dictionary-for-21st-century-computing-2.html

    https://blackliszt.com/2017/03/devils-dictionary-for-21st-century-computing-3.html

     

  • Nobody Cares about Data

    Nearly everyone professes to LOVE data. Just think about all the talk about Big Data, Data Lakes and the rest. Lies. Liars lying big LIES. Everyone says they like data … until they get near it. Suddenly they develop fevers and rashes. They're allergic! Someone else will have to actually handle the data!

    Data, the foundation of AI, ML, Analytics

    All you have to do is get a job in one of these fancy subjects, and you quickly get hit with reality. When you were in school, you had wonderful exercises where you could develop your skills in deep learning, random forest, or whatever. Now in the real world, some older person assigns you to some juicy-sounding task where you'll get to use your skills. Where's the data? you ask. A wrist-wave of the hand tells you it's over there. You go over there and can't believe what you see. Why, it's nothing like it was in school! You try for a couple hours to clean things up. Then a couple days. It's still bad, but maybe good enough. So you run it through some models. Disaster. The system crashes and/or generates garbage. You complain. "Grow up," you're told. "This is what we've got to work with. Deal with it."

    At the end of a year, you realize you've spent half your time in meetings of one kind or another, and 90% of the "working" time has been spent trying to get the data in order. With unsatisfying results. You've got some choices to make. You can lie. You can get into management, marketing or sales. You can roll up your sleeves, forget the fancy stuff you learned in school, and become a data clean-up specialist, which is actually more like a create-decent-data-from-scratch specialist. Which is NOT what you signed up for. Waaaaahhhhh.

    What's maybe worse of all is the status. AI and machine learning are clearly the prestigious upper floors of a grand apartment building. Deep learning thinks it's the penthouse, but whatever. The lower floors are occupied by simple analytics. The ground floors are occupied by people managing the databases and Hadoop clusters, and maybe even some ETL tools.

    And then there are the basements. The sub-basement where the garbage chutes end. Where the janitors live. Where the crap from the elegant apartments is taken to be discarded. Where the water and oil and natural gas enter the building — the things the fancy people on the upper floors need to wash up, keep warm and prepare to dress elegantly. That's the floor … and the status … of the data specialists. 

    You can tell yourself until you're blue in the face that without good data, none of the fancy stuff would work. It's the foundation, dammit! The janitors probably tell themselves the same thing about the heat, cooling, hot and cold water, cleaning and garbage removal. True — but they're still janitors, wearing a uniform and passed in the halls by the upper-floor people as though they don't exist. 

    Bad data equals bad results

    There's a simple reason why the incredible potential of the Big Data movement has now morphed into AI/ML and is even incorporating Blockchain. The time passes, tick. Tick. Tick. Tick. No results! Uniform use of the future tense! Claimed successes aren't really, when you dig into them.

    Some of the reason is typical organizational incompetence. But much is also due to the fact that we are swimming in a sea of big data and no one wants to clean it up! It's so bad, we mostly don't acknowledge it; much easier just to ignore it.

    This problem isn't new. See this:

    11

    I've talked about the importance of data as the foundation of AI/ML here. I've illustrated the horrendous problem of bad medical data here. Even basic data, like what providers are where, is wrong too often. By the way, these illustrations should be considered informal tips of massive icebergs. When I talk with true experts who are themselves knee-deep in this stuff, I find the situation is … even worse.

    As today's illustration of the problem, let me show you a piece of mail I got. It's from a major corporation, one of the big regional cable companies and internet service providers. They've got decades of experience working with customers in their geography. They've got to know every address, every household, with complete histories of using their service, dropping it, signing up again. How could they not know the basic demographics and the kind of approaches that work and the ones that don't?

    Here's the mail I got (I blocked out the street number):

    JO Black Optimum ad

    Looks OK, right? Nice and clear. Specific to the town, so it feels personal. Lots of good things about it. They even designed the envelope so you could see the plastic card on the right, with an eye-catching banner over it.

    There's just one little problem. JO Black died in May 2001. More than seventeen years ago.

    I don't think there's anything else I need to say except, good job, Optimum! You're doing a great job illustrating the near-universal toxic, rotten ocean of data in which we swim, and doing your part in keeping it that way.

    Wait, you might say, this is a trivial little problem. In a way, it is: one piece of mail that shouldn't have been sent. But it's an illustration of a problem that's broad and deep. The notion that a wrongly sent piece of mail "means nothing, is trivial" is an attitude that is EXACTLY why people who care about data metaphorically wear uniforms and work out of the basement. Maybe Optimum is worse than all the rest. Sorry, they're not. JO Black gets a VERY slowly diminishing stream of mail at this address from a wide variety of vendors, large and small. So does Mrs. Grace Black, who died 4 years ago. So does Ms. Jessica Black, who lived here for awhile before moving 20 years ago. So do Mr. Samuel Black and Ms. Elspeth Black, who never lived at this address.

    The Problem is Everywhere

    To be totally clear: the problem isn't just wasteful mail solicitations. It's everywhere, and every stage of data collection and utilization. The problem with healthcare data is immense, for example, as I've illustrated often. Bad healthcare data, which is ubiquitous, has the direct result that normal, innocent people needlessly suffer and die. It doesn't get better, because all the smart people and the important decision-makers are busy attending conferences about how AI is transforming medicine and how blockchain will solve all the medical data problems — leaving the ragged crew of people who are supposed to fix the problem ignored in the dank basement, spending their time scheming how they can at least get to the first floor, since it's perfectly obvious that no one is actually interested in … fixing the data problem!!!

    Conclusion

    Everybody says they want data. BIG data. But what they really want is a springboard to do something prestigious, which turning a toxic stream of severely polluted data into something textbook-clean is not. While hardly the only factor, this is a major factor in the widespread untalked-about failures of fancy modern techniques to deliver practical results. The plain fact is, nobody cares about data.

  • The Hierarchy of Software Skills and Status in Data Science

    There is a striking hierarchy of skills in software, as I've explained here. When you dive into any particular aspect of software, you usually find that it's got a hierarchy all its own. Data science is a subject of intense interest these days, so in this post I'll explain some of the basics of the data science skills hierarchy.

    A skills hierarchy is very much an insider's game. What most people care about is status. I talk about the basics of software status here. Remember, the skills hierarchy is a whole world away from the status hierarchy that most people care about. Don't confuse the two!

    Data Science skills

    First and foremost, it's important to understand the incredibly broad range of subjects covered  by the term "data science." I attempt to explain the basics of the range in this blog post. You can be just amazing in one of those subjects, while being a neophyte in one that the outside world may consider to be "related," but which in practice is not just down the hall, but in a different building on a different campus.The general understanding of this range is SO pathetic that "data science" is typically managed as a completely independent, free-standing group. Which makes about as much sense as believing that a sous-chef belongs in something that isn't a kitchen. Or that everything that everyone does in a kitchen is basically the same thing.

    Here's one cut at the hierarchy in data science, starting from the base:

    1. Tool users.These are people who have learned how to use some software tool, or maybe a couple. Most "data scientists" fall into this category.
      • They don’t understand how the tools are built or any of the underlying software
      • They may know their tool, but aren't real clear on what the other tools are about, much less when you might consider using one.
    2. Some have broader knowledge of tools
    3. Very few have the sophistication to understand real data analysis, per my series of AI/ML; see this and the links in it: https://blackliszt.com/2018/04/getting-results-from-ml-and-ai-3-closed-loop.html
    4. Of those, even fewer can understand the underlying algorithms and follow the latest research literature
    5. Of those, even fewer can make real algorithmic advances and implement them as tools and deliver practical value.
    6. Of anyone who can do all of the above, it is rare to meet anyone who can address and solve deep tool-level problems in other domains needed to make their code practical in the real world.
    7. Finally, it is extremely rare to meet people who, in addition to their deep and broad prowess in data science and relevant skills needed to make it real-world practical, have similarly deep skills in an associated domain needed to make the data science fully effective for a business.

    The issues of this hierarchy are compounded by the usual over-selling by people who are good promoters but little else, and corporate/government big-wigs who don't want to be bothered with details, but are keen to be seen as "doing something" on such a high-visibility topic. Getting results that make a difference is pretty low on the typical priority list.

    And then of course there are the "data scientists" themselves, who most often are sincere people who are trying to do a good job as they've been taught to do it — mostly by professors and others who have no idea what real-world success looks like, much less how to bring it about.

    Finally, there is the usual "manage something that's invisible to you" phenomenon I have often discussed in this blog, which leads to so much dysfunction and so many wonderful Dilbert cartoons.

    Conclusion

    People talk as though "data science" were a thing, with the usual kind of hierarchy based on level of management and/or "experience." Those typical patterns of hierarchy just don't cut it for understanding what's going on in data science, just like they don't cut it for understanding software development. We will continue to see waste and dead-end efforts until we at least make a start at making our understanding of data science more sophisticated, and aligned with the facts on the ground.

  • Using Advanced Software Techniques in Business: Fashion or Real Value?

    Using advanced software techniques can make a dramatic positive impact on business. It’s important for everyone to assure that your software efforts aren’t stuck in out-moded, last-generation tools and techniques.

    Nearly everyone, including me, agrees with this simple statement. Nearly everyone also agrees on at least the top members of a list of reasonable candidates of “advanced software techniques” that are not “out-moded” or “last-generation.” That last statement is where the best technical people part ways with the crowd.

    No, I’m not talking about hard-to-understand weird-o’s babbling about esoterica in some corner. The best technical people understand and support using the best and most appropriate techniques for solving a given problem, regardless of the recency of the technique or its prevalence. Sadly, it is often the case that the most-talked-about hot trends in software should not be used in any business that actually wants to spend money widely and get stuff done, quickly and well.

    This is a BIG subject. It’s important. It’s deep. And it’s extensive. So let’s start with a big, fat, juicy example, one that was hot, hot, HOT but is now fading away, so it’s possible to talk about it somewhat more rationally. Maybe. I hope.

    Big Data and Hadoop

    There is little doubt that Big Data is a huge trend in software, though at least talking about it with that name appears to be undergoing a typical slow fade. Here is a review I wrote more than 5 years ago of the Big Data fashion trend as it existed at that time. It was everywhere you looked! Magazine covers! Ads! Conferences! Books! If you weren't somehow doing Big Data you were nothing and nobody.

    I've been working with data my entire professional life. The data has been small, medium, large, big, huge, totally awesomely huge and even gi-normous. Since I've been faced with space and time constraints, I have long since settled on a fundamental concept of computing, simple but rarely done: Count the data! It sounds ridiculous, but it's almost a secret weapon, and appears to be rarely done. Here is some analysis of data sizes in the context of the big data trend. Here is a more detailed example of a big data set that Harvard bragged about. Hint: the data isn't very big.

    I shouldn't have to say this, but here it is: For anything that people say is "big data," the very first step should be to … count the data. Sounds simple, but apparently it's not. It's also not common to dig into the data a bit. I guess it's uncommon because in most cases of data that starts out looking big after you count — this by itself is rare — is that you find that most of the data is just not needed or not relevant. Which makes it not big anymore. Which means you don't need Hadoop!

    Hadoop

    I know I'm being silly here. Hey, we're talking Big Data! Surely we've got some somewhere. We've got to get in an expert and crunch away so we can get those virtuous, business-enhancing juices flowing through our company.

    In this situation, at least until recently, what that meant was that you dove into the next level of detail and found out that the go-to tool was Hadoop. Dig in some more, and it sounds great. It's scalable without limit. You build your Hadoop cluster, script up some calculations, tap into the ocean of data you've got somewhere, and hear about how the Hadoop spins those computers up and down, and crunches all the data, using the computers that are available, and even working around ones that fail without anyone having to respond to some old-style beeper or something. No wonder Hadoop is the go-to tool for Big Data!

    Yahoo 1

    In the vast majority of situations, it's "decision made" time at this point. You get in your experts, they build their Hadoop cluster and away you go, climbing the Hadoop stairway to Big Data Heaven, with a glow of virtue surrounding everyone involved.

    Very few people seem to dive in and understand what Hadoop and its main programming paradigm MapReduce are all about. The Hadoop "experts" don't seem to know what the reasonable alternatives are, and when they might be applicable.

    Here's an example. In 2011, one of our large web companies had a huge problem caused by Google's move to a new search algorithm. The CTO grabbed a massive web log file, wrote some code to boil the terrabytes (Big Data for sure!) of data down to the key data elements, and then loaded them into the 512GB of DRAM of his powerful laptop computer and ran some advanced machine learning against it. You can see the CTO doing the work here. A few days later he had figured out Google's algorithm, reflected it in the company's website family, and traffic increased back to nearly the pre-change norm. If he had taken the Hadoop path, he would have worked for months, spent huge amounts of money, and found that the cluster and Hadoop thing would have basically been irrelevant to the problem.

    Here are a couple things to consider:

    • Hadoop, by definition, spreads its computing over the many machines available to it in the cluster, using HDFS (the Hadoop file system) for reading and writing data.
      • It is literally thousands of times faster to get data from local memory than it is to get it from a disk-based file system. The fewer file reads and writes needed to perform a computation, the faster it will be. Hadoop doesn't care.
      • Using more computers in a cluster means that there will be more I/O than using fewer. Many important calculations can be performed on a single properly-configured machine!
    • MapReduce, the key processing engine of Hadoop, is one of those cool-sounding ideas whose job can be done perfectly well with normal code, which can do way more, vastly more efficiently.
    • Why would anyone consider such an insanely wasteful approach? Once you know the origins, it makes sense.
      • If you're a big search engine company, you have to have loads of servers, enough to hold all the data and handle all the search queries at peak traffic times.
      • As is typical in situations like this, loads of servers will be under-used a large fraction of the day. Why not write some code that sucks up these "free" cycles and put them to work? Why not build a framework so you can just specify what you want done, without worrying about what resources from what machine where is used? Who cares if it's inefficient? It gets stuff done with the computers I already have. Brilliant!
      • Now it makes sense that Hadoop started and grew at Yahoo, copying some ideas about a narrowly-applicable (MapReduce) system and framework built at Google.
      • Except that at Yahoo, they somehow decided to make the Hadoop machines dedicated! Last I heard, they were up to, get ready … 40,000 servers. Wow.

    Yahoo 2

    • With such an investment in getting value out of Big Data, Yahoo must be booming, just sky-rocketing with all the juice that has come out of the investment. Not. Why would anyone want to use an expensive, strange tool that generated no value for its originator? One word: fashion.
    • Yes, there are some narrow situations in which Hadoop might be applicable. But in the vast majority of cases, you'll spend too much time getting way too many computers to do too little processing on not all that much data, and taking way too much time to get it done.

    Conclusion

    There is no doubt — none! — that you should use advanced software techniques in your business, because it will give you a competitive edge over everyone else.

    The trouble is telling the difference between (1) value-adding advanced software techniques and (2) hype and software fashion. Even most software people have trouble telling the difference! In fact, software people who insist that there is a difference between value-adding software techniques and the latest thing that everyone is talking about run the serious risk of being marginalized, and categorized as being old farts who are unable or unwilling to do the work to learn the new methods.

    In sharp contrast to the general thinking, software is a pre-scientific, fashion-driven field that resists holding new ideas to reasonable standards of proof and evidence. This makes it tough for business executives to know what to do. There is only one approach that works: roll up your sleeves, put ego and pride to the side, and figure it out using evidence and common sense.

  • Getting Results from ML and AI: 3, Closed Loop

    Getting practical, real-world results with ML and AI involves more than getting data, doing calculations, and building models. You can do everything else right, but if you don’t get this last step right, you’ll join the rapidly growing ranks of people who may have tried hard, but ended up accomplishing little in real-world terms.

    The first part of this series laid out the issues and concentrated on the indispensable foundation of success, the data. The second part of this series dove into the analytic methods that can be used to generate value, with some advice about how to sequence the methods used.

    In this post, we’ll concentrate on the relationship between the real world and the back office analytical work. What we’ll find is that an integrated, collaborative, closed-loop relationship between measuring, calculating and real world application is the path to success.

    Loops, open and closed

    Whether you run an operation closed loop or open loop is one of those absolutely key concepts, highly correlated with success, that is rarely discussed. Generation after generation discovers it by itself, or not, nearly always without fanfare. Who talks about the key role played by the invention of the governor in 1788 by James Watt – the invention that made his steam engine practical? In that case, the governor was the newly-minted part of a steam engine that kept the pressure of the steam reasonably constant. With a governor, steam engines no longer blew up, as they regularly did before its use.

    Centrifugal_governor

    It’s important to understand that the reason a governor works is that it’s an integral part of the steam engine. Steam goes into the governor, which then controls the throttle valve of the engine, slowing it down when it’s getting dangerously hot. This is closed-loop.

    In more modern terms, running open loop means going on and on down a path without real-world feedback and testing of your work as it is being developed. It’s a little like trying to walk to a goal post at the opposite end of a football field with your eyes closed, using a carefully planned sequence of steps and turns. That’s open-loop, which essentially mean no feedback. But shockingly enough, a huge fraction of highly technical efforts in software and analytics operate in just this way!  The people in charge insist they’re experienced, they’ve got a thoroughly vetted plan, and everyone should let them alone to get their work done.

    There are many similarities between war-time software and running closed loop for analytics. Driving towards a goal, letting nothing get in the way. Optimizing for speed, not expectations. Leaping to a place that's better than today, and then cycling improvements.

    The easiest way to see the difference is thinking about the previous posts in this series. Have you spent lots of time with data, and applied simple calculations to it? If not, you should. Once you have, … you should put your new understanding into practice! It may not be the very best solution that’s possible, but if it’s marginally better than what’s in place today, you should roll it out at least in a limited way and see how the world reacts to it. You’ll learn stuff! You may end up learning there are more variables you need to account for, different ways it needs to be applied, all sorts of things! In other words, don’t sit on the beach by the water for months – wade right in and see what it’s like. That’s when you’ll really start learning.

    The World Responds and Changes

    The key concept to understanding why running closed loop is so important is that the “world” is an incredibly complex, ever-changing set of actors. When you do something – almost anything – the world changes in response to what you did, if only in a small way. You have to run closed loop to respond appropriately as the world responds to your actions.

    Oh, you may say, I’m just the genius in the back room who’s an expert in this or that branch of ML. I’m not acting on the world. I just need the time and support to get my amazing modeling work done.

    That may be true. And that’s the problem! The whole point of doing ML/AI/etc. is to change something in the world –  and it’s guaranteed that the world will change in response! Accounting for the responsive changes is just as important as whatever it is you first put out there. Even worse, the world constantly changes independent of anything you may do. So the solution you modeled for may not be valid, given the changes that happened.

    Think about the carefully planned walk on the football field to the goal posts I described above, and how hard it would be to accomplish with your eyes closed, i.e., with no feedback. Now think about the same situation, except there's an opposing team on the field! You carefully study everything about the opposing team. You know who they are and where they are. Then the play starts and you start to execute your exquisitely planned march to the goal posts. Here's the trouble: opposing team members see what you're doing, and they change their positions! They move! Even worse, they run towards you and try to tackle you. And you are helpless, because you are carefully executing your wonderful plan with your eyes closed, unable to react to the other team's movements. Is that stupid or what? It's not just stupid, it's inconceivably stupid. That's why I spelled it out, because that's exactly how most ML and other analytics efforts are carried out. Open loop. Assuming that the world does not change in response to what you do.

    Of course, the world is unlikely to be quite as single-minded and determined as members of an opposing football team. But you'd be surprised! You're making changes in the real world. Whatever you do, there are probably losers. Losers who won't be happy, and will change their behaviors so they become winners again. Or simply fail to act in the predicted ways.

    Conclusion

    Running closed-loop is absolutely indispensable to achieving success. Put something simple in the real world and then cycle, making it better and better, using increasingly sophisticated techniques. Whatever your final crowning technique is, whether it's ML, AI or something else, success will be yours, and you'll enjoy it all along, without the risk, anxiety and likely failures of the usual highly planned methods.

  • Getting Results from ML and AI: 2

    Getting practical, real-world results with ML and AI isn’t just a matter of hiring some people with the right credentials and throwing them at the project. Most such efforts start with fanfare but then fade into failure, usually quietly. The first part of this series laid out the issues, described the path to success, and concentrated on the indispensable foundation of success, the data. Data that has to be collected, corrected, enhanced and augmented – a time-consuming process that has no “advanced algorithm” glory, but MUST be done, and done well.

    In this post, we’ll concentrate on the analytic methods that a successful project uses to generate value from the data so arduously collected and corrected.

    A Little Background

    I say I’m focused here on ML and AI. I just said that because it’s what everyone is talking about. What I’m really focused on is algorithms for understanding and getting value out of data. So I lied. Even worse, I’m not sorry – because just thinking that what’s important is to use the latest ML and AI techniques is central to the failure of most such efforts to deliver value.

    I guess I can get over my programmer-ish prissiness that things are getting new names. What I refuse to get over is that lots of important, really valuable techniques are usually left out of the grab-bag of “ML and AI.” I won’t be comprehensive, but I think a glance at the landscape might help here.

    There are a couple different ways to understand useful algorithms and how they came to be. Roughly, they are:

    • Follow the algorithm, taking a fuzzy lens for the naming and details
    • Follow the academic departments that “own” the algorithm
    • Follow the problems the algorithm has proven to be good for

    These ways overlap, but provide useful angles for understanding the algorithms and where to find them.

    Let’s illustrate this with an amazing, powerful algorithm that is usually sadly ignored by people who are into ML and AI. It’s most often called linear programming (LP).Those who are into it think of it as being one of a category of algorithms called mathematical programming. More broadly, it’s normally “owned” by academic departments of operations research (OR). OR studies repeating operations like responding to repairs for appliances or controlling the output of oil refineries when prices and costs vary and optimizes the results. It’s been used for decades for this purpose in many industries, and is being rolled out today to schedule infusion centers and operating rooms in hospitals.  

    This isn’t the place to spell it all out, but knowledge of amazing algorithms like LP is scattered over departments of Engineering, Computer Science, Math, Operations Research, Statistics, AI and others. The point is simple: the world of useful algorithms and modeling techniques is vastly greater than ML and AI.

    The Natural Sequence

    There are dozens and dozens of methods that can be used to analyze and extract value from data, which after all is the point of ML and AI (and, by implication, all the other great algorithms). As I described in the prior post, there is a natural progression or sequence of methods, which roughly follows their order of discovery and/or widespread use. Success usually comes from using the methods in the rough order of their invention as you climb the mountain of understanding from simple and obvious (in retrospect) results to increasingly non-obvious and subtle results.

    I often see the following reaction to this concept, rarely articulated but often acted upon: “Why would I want to waste everyone’s time playing around with obsolete, outdated methods, when I’m an expert in the use of the most modern ML and/or AI techniques? I’m sure that my favorite ML technique … blah, blather, gobbledygook … will yield great results with this problem. Why should I be forced to use an ancient, rusting steam engine when I’m an expert in the latest rocket-powered techniques, ones that will zoom to great answers quickly?!”

    The unspoken assumption behind this modern-sounding plea is that analytical techniques, ranging from simple statistics and extending to the latest ML, are like computers or powered vehicles. With those things, the latest versions are usually WAY better than prior versions. You would indeed be wasting everyone’s time and money if you insisted on using a personal computer from the 1980’s when modern computers are many thousands of times better and faster.

    The trouble with this line of thinking is simple: the metaphor is inapplicable. It’s wrong! Analytic techniques are NOT like computers; they are like math, in which algebra does not make simple math obsolete – algebra assumes simple math and is built on it. Calculus does not make algebra obsolete – calculus assumes algebra and is built on it! And so on. Each step in the sequence is a refinement that is built on top of the earlier one. No one says, now that I know calculus, I refuse to do algebra because it’s old and obsolete. See this for more on this subject.

    So it does make sense to quickly apply simple methods to the data to get simple answers, and at the same time vet your data. No time is wasted doing this. On the other hand, if you jump straight to someone’s favorite ML technique, not only is it likely that inaccurate and incomplete data will render the results useless … you won’t even know anything is wrong! Because most ML techniques do nothing to reveal problematic data to the researcher, while simpler methods often do!

    Fundamental Analytical Concepts: Calculate it methods

    The simplest and most useful methods are ones in which you simply calculate the answer. There’s no modeling, no training, no uncertainty. These methods are highly useful for both understanding and correcting the data you’ve got. The basic methods of statistics like regression apply here, and so do the methods of data organization and presentation usually called OLAP, BI and dimensional analysis. The tools associated with a star schema in a DBMS apply here, which are roughly the same as pivot tables in Excel.

    Graphing and visualization tools are important companions to these methods; they help you really understand the numbers and see to what extent they make common sense and match reality. For example, you can see to what extent a doctor’s years of experience correlate with ordering tests or issuing prescriptions of a certain kind; or simply identify the doctors whose actions stand out from the rest. There could be a good reason why they stand out; wouldn’t you like to find out why? Maybe the doctor should be emulated by others, or maybe the doctor should be corrected; either way, you should figure it out.

    Until you’ve pursuing all lines of thinking based on these simpler methods, it’s premature to move on.

    Fundamental Analytical Concepts: Solve/Optimize it methods

    These are, IMHO, the gold standard of algorithmic improvement. When applicable, they tell you how to reach a provably optimal result! No training. It takes experience and judgment to apply the generic algorithms to a particular problem set, and sometimes the problem needs to be adjusted. But the results are stellar.

    First, you create an equation that measures what you’re trying to optimize. Is it fastest time? Lowest cost? Least waste? Some combination? Whatever it is, that’s what you’ll maximize or minimize as the case may be.

    Next, you determine the constraints. You only have so many operating rooms? This kind of machine failure requires a repairman with that kind of skills?  Then you put in the inputs and solve. While I’m leaving out lots of detail, that’s the basic idea.

    These methods, usually of the OR kind, have been applied with great success for decades. In certain fields and industries, they are part of the standard operating procedure – it would be unprofessional to fail to apply them. And you would rapidly lose to the competition.

    Fundamental Analytical Concepts: Train it methods

    The training methods all require sample data sets on which to “train” the model. Selecting and controlling the data set is key, as is avoiding over-training, in which the trained model can’t generalize what it’s been trained on, and thus loses most of its utility.

    Fundamental Analytical Concepts: Train it methods: white box

    What characterizes these methods is something incredibly important: what the model does can generally be explained in human-understandable terms, i.e., it’s “white box.” This has huge value, if only to gain acceptance for what the model does – but it may also bring up problems with data that can lead to further improvements.

    There are lots of ML algorithms that are in this category. All the decision tree methods are here, among them the very important random forest method, along with methods that arose within the field of statistics such as CART.

    0

    Fundamental Analytical Concepts: Train it methods: black box

    These methods can produce amazing results, and should be used whenever necessary, i.e., whenever earlier methods in the sequence can’t be used. The fact that the model is “black box” means that it’s difficult if not impossible to understand how the model makes its decisions in human terms – even for an expert.

    These methods include neural networks in all forms, including all the variations of “deep learning.”

    Fundamental Analytical Concepts: Rules

    Finally, I add the indispensable attribute of success in many practical systems: human-coded rules. These can be inserted at any point in processing, as early as enhancing the data before any methods work on it, and as late as modifying the results of final processing. While not often explicitly discussed, few practitioners with successful track records avoid the use of rules altogether. They may not be pretty or fancy or elegant – but they work, darn it.

    00

    More elaborate than sets of rules is the technique in AI of expert systems. This is a whole big subject of its own. Generally speaking, if you can get useful results from one of the sequence of methods up to and including white box training systems, you should do so. But important categories of problems can only be solved using expert systems, which ideally should be as white box as possible.

    Conclusion

    There is a broad range of analytic techniques that can be applied to a given problem. There is an optimal sequence for understanding the data and the problem. Going from one step in the sequence to the next, when done correctly, isn’t abandoning a method for something better, but first picking the low-hanging fruit and then moving on to catch tougher stuff. Prejudging the best technique to use before really getting your hands dirty is a mistake. Being a specialist in a particular method, e.g. “deep learning,” and confining your activities to that method alone can get you hired, paid and busy, but may lead to no useful results, or results far less useful than they could be.

  • Big Data’s Big Face-plant

    Big Data is huge. Everybody wants it. If you're not doing it, you're hopelessly antiquated. But it has serious flaws. The high-profile role played by Big Data in the recent election provides an excellent example. Calling those efforts a "face-plant" is kind. In addition to illustrating many of the glaring flaws I have previously enumerated, this face-plant clearly and explicitly demonstrates the corrosive effects of bias: the experts weren't seeking the truth — they were rooting for an outcome. Given the undeniable predictive failure, you'd think a little self-reflection might be in order. This post uses the recent election Big Data failure as an example. The flaws it illustrates, and others, are common in Big Data efforts, and are the reason why so many of the much heralded efforts result in no substantial benefit.

    The Big Data Experts

    In recent years, Big Data election experts have attained great visibility. Their pronouncements are more closely followed than those of the candidates themselves. Nate Silver has been the reigning god, but a new one exploded onto the scene this election season. Here's the story as it appeared in Wired Magazine, just days before the election:

     Wired 1

    The story got serious attention, as you can see from more than 24,000 Facebook shares. How big is this guy and organization? Real big:

    Wired 2

    Who is this guy? Read on:

    Wired 3

    Clearly a massive math and science wonk. No one else gets into CalTech, much less gets a Stanford PhD in science.

    What did he say about the election? Of course the picture changed as election day drew close, but all the math pointed strongly to a Clinton victory.

    The debate as the election drew close was interesting. It wasn't whether Clinton would win — everyone thought she would — but since they're math guys and they know this isn't physics, they argued about the probability she would win, and about the margin predicted.

    Dr. Wang ratcheted up the probability of a Clinton win all the way up to 99%. That's pretty darn certain! Here's his argument for why such certainty was reasonable:

    2 PEC

    Yup, it was sure a giant surprise, all right!

    Here is his description of his calculations and why they're reasonable, if you can stand it. If not, that's OK, just skip ahead:

    3 PEC

    There's lots more stuff on the site. By all means check it out for a great example of self-delusion by a celebrated Professor Doctor. Here is a sample:

    4 PEC

    For any readers who actually know math and science, you'll know right away that this is a specious argument: it's a lot of words that are math-y, but they bear no real relationship to the actual probability of Clinton winning.

    Late afternoon of election day, he posted his last prediction:

    1 PEC

    This was not a search for truth

    How could Professor Doctor Neuroscientist Sam "Election Hero" Wang have gotten it so wrong? In addition to committing many of the standard errors and unusually bad interpretations of probability I've mentioned, there's another reason: Wang was not seeking truth. Dr. Wang was an advocate. He badly wanted an outcome. He wasn't predicting for prediction's sake — he was predicting to find out which races were close, so that scarce funds could be allocated to sway the outcome of those close races. How do we know? Here are Wang's own words in that same final post, which he repeats with emphasis in the comments:

    Activism

    This also explains how he got famous — he was drizzling science-y pixie-dust on the outcome that he and many other people wanted. He told them what they wanted to hear.

    Could it be that Dr. Wang has an unblemished track record of prior predictions, and let his emotions get the best of him in the 2016 election? Sadly, no.  Look at this powerful — 98% probability! — prediction, his final one before the 2004 election:

    Final prediction

    What we've got here is an advocate posing as a scientist, spouting out what his fans want to hear with lots of math-geek talk to make it sound solid, but who gets it badly wrong. Repeatedly. Surely, all right-thinking people would turn their backs on him, right? Science is about making predictions that come true, and if your predictions are wrong, you're just a promoter with no credibility, right?

    Sadly, no.

    A prospect

    There is clearly an audience for people who tell readers what they want to hear with math-y icing on top.

    Conclusion

    The Big Data juggernaut rolls along, its momentum unabated. The face-plant of Big Data analytics in the 2016 should have been a wake-up call, regardless of your political views, of the inherent dangers and deep biases that send all too many Big Data efforts into the gutter of failure. Everyone appears to have moved on unchanged, which makes sense, because it was never really about science and truth to begin with. It's sad to see exotic BIG data efforts getting lots of money and attention, when humble LITTLE data efforts are causing daily pain but starved for funding. See this. However, if you want to get value out of Big Data and associated technologies, be assured that it can be done. Just take this story as another note of caution.

  • Devil’s Dictionary for 21st Century Computing

    Ambrose Bierce wrote the Devil’s Dictionary in 1910, delighting and edifying cynics everywhere. Stan Kelly-Bootle wrote a new version for the world of computing called the Devil’s DP Dictionary in 1981, and a later edition in 1995 called the Computer ContraDictionary. These are timeless works, providing valuable insight and inspiration for cynics to this day. But there are modern computing terms that came into use after these geniuses had passed onto their reward. It’s time for at least a first draft of a Computer Cynic’s Dictionary for the 21st Century.

    Ambrose Bierce

    Mr. Bierce started publishing definitions many years before the first book appeared. Here is the start of a column from 1881:

    Devil

    You can see that from the very start, Mr. Bierce had the ability to get at the heart of things using few words.

    Stan Kelly-Bootle

    Ambrose Bierce was clearly a tough act to follow, but the new computer technology was such rich soil that Mr. Kelly-Bootle felt that an attempt had to be made. And a heroic attempt it was, providing insight and edification all these years later. The following couple of simple definitions get right to the point:

    Stan

    In other definitions, he gets a bit more cutting:

    CS

    Cynicism in the 21st Century

    Many new terms have entered the world of computing since Mr. Kelly-Bootle last graced us with his wisdom. Reasonable people may ask, "is cynicism dead?" "Will such juicy targets remain unskewered?"

    I have searched high and (especially) low, and found nothing but piles of dry computer-babble, peppered with ignorance and misinformation. I have yet to find a good source of penetrating definitions for any the terms being thrown wildly about in today's discourse. I feel I have no choice but to offer some of my own definitions, sad exemplars of the type though they be, in hope of challenging those with the true, deep knowledge of a Bierce or Bootle to counter with their own superior definitions.

    Here is the first installment. Should I somehow avoid assassination, more will follow in future posts.

    Big Data

    A subject of which no self-respecting executive may claim ignorance; an expensive, ever-growing collection of hardware and software managed by people who spout a dizzying array of acronyms with confidence and certainty, with mounting expenses and benefits that are just about to be realized.

    A collection of data, presumed to be large but normally fitting in a backpack with room to spare, which is said to contain untold riches if only they can be found and unlocked with mysterious keys like Hadoop.

    An approach to analyzing incredibly huuuuge collections of data that has been recently invented, bearing no resemblance whatsoever to outdated technologies such as data warehousing and business intelligence, and sharing none of their drawbacks.

    Artificial Intelligence

    A kind of intelligence, sometimes implemented by computers, which would be decisively rejected by all right-thinking people if it were food. It is the opposite of organic, free-range, unprocessed intelligence – it is chock-full of GMO’s, fructose and artificial ingredients of many kinds.

    The growing crisis of insufficient intelligence is being addressed by some leading scientists, who are leading the way in the creation of artificial intelligence to fill in the gaps left by inadequate supplies of naturally-occurring intelligence. Like the green revolution in agriculture, many hope that this emerging “grey revolution” will put a stop to the persistent intelligence shortages that make so many miserable. While some elites sneer that artificial, non-organic intelligence is deeply harmful, most of the deprived are glad to be served intelligence of any kind, however artificial it may be, rather than their current meager diets containing precious little intelligence of any kind.

    A purposely vague term, referring to an ever-growing set of tools and techniques, that are said to do stuff that people usually do, only better. AI programs have advanced from early victories in playing checkers to wins against chess masters. They have finally achieved the pinnacle of human intelligence, winning the game show Jeopardy. After decades of marching from success to success, today's leaders of Artificial Intelligence anticipate that practical applications of the technology are certain to emerge. If not, they threaten to further inflate the definition of Artificial Intelligence to encompass normal computer programs written by ordinary human beings, at which point success will be theirs — since a computer program is, without doubt, artificial.

    Conclusion

    I expect to release more definitions in the course of this year.

  • My Cat Taught me about the state of Healthcare Provider Data

    My daughter's cat taught me a major lesson about healthcare as I described here. Pretty amazing. But Jack the cat also thought I should learn about the advanced databases that providers and insurers maintain about each other. While not as brilliant as the inter-provider EMR interchange breakthrough I've described, the databases have a similar effect to the brilliant gamification strategies for wellness implemented by leading hospitals, but take a whole different approach. The depth and extent of innovation in this industry never fails to amaze me.

    Jack's learning environment

    As I described before, the terrified cat was outdoors and I had to pick him up to bring him inside. He was scared, so he scratched and bit me. I saw my doctor and got a mis-prescription for antibiotics. Then I needed an X-ray to see what was going on inside the hand that was painful after weeks. That's the situation.

    Jack the cat decided this was an opportunity for me to learn about databases and get some extra exercise, no doubt as penance for failing to pet him well or often enough.

    The search for the X-ray provider

    First, I got a referral to a provider that was way far away from where I live. How did this happen? The doctor claims she called me to find where I live twice and got no answer. Hmmm. I guess the information was mysteriously missing from my records and no one thought it was important to get it, and I guess the fact that I only got one message, and it had no request for where I live was just … whatever. So I decided I better get active, rather than waiting another couple of days for a referral.

    I went onto the Anthem site — the provider of my health insurance in spite of their horrible computer security track record. I discovered a provider that is covered by them just a couple blocks from where I live:

    X-ray

    That should be an easy walk. After more fumbling with the doctor's office, I finally got them to give me a referral.

    Here's the place to which I was referred:

    XX

    Same place. Good. I called them up, and they said no appointments were required, just show up with the referral. I walked right over, but they weren't in the building directory. Hmmm. I asked the person at the desk, who had clearly seen confused and lost people like me before. She told me they've moved, and gave me the new location. Great!

    I went back home, and discovered that someone else at my doctor's office had also given me a referral, only to a place that actually has an X-ray machine. So out I walked again, and got my medicinal dose of radiation.

    Anthem didn't know that they'd moved. The people on the phone at the X-ray place had no idea. One person at my doctor's office did know — but another one didn't. In normal life, companies that acted like these did — my doctor, the X-ray place and the insurer — would be out of business. But as we all know, healthcare isn't normal life.

    Big Data and Blockchain

    What happened with me was no big deal. Business as usual in healthcare, and in this case had no consequences beyond getting me to walk more, which is a good thing whether I decide to do it or I'm tricked into doing it.

    But let's consider the consequences of this trivial episode.

    Where are the Big Minds, the elite in healthcare, spending their oh-so-valuable time and effort? Lots of things, of course, but two of the big obsessions are Big Data and Blockchain. Each of these, for different reasons, is a holy grail of technology for healthcare, if you pay attention to the talks, conferences, articles and real dollars invested.

    Big Data is a focus because the leading thinkers and influential, powerful people are convinced that if all this healthcare data is poured into a giant Hadoop data lake and poured over by ultra-modern machine learning tools, we'll discover important things that will make us all healthier.

    We already knew that EMR's are riddled with data problems; now Jack has shed light on problems elsewhere:

    • If the data is missing or wrong, no amount of bathing in Data Lakes will cause accurate results to pop out. Bad data in, bad results out.
    • If there are protocols that have been proven to be the best for treating patients and doctors simply refuse to follow them, nothing improves.

    Blockchain has attracted the attention of leading figures among the healthcare elites because of its awesome promise to solve the problem of data interchange and effortlessly created universal health data — on which Big Data can proceed to work its magic.

    BUT … if no ones cares or is allowed the time to get the data accurate and complete and the data is no good, spreading it around hardly helps anything.

    As usual, all the attention goes to the highly visible frosting on the cake, while the underlying layers of the cake rot from inattention.

    The consequences of extraordinary cat knowledge

    This valuable knowledge about provider databases and the reliability of doctor decision making came from just a couple days of cat-sitting our daughter's cat. The experience was so rich that we decided to get a cat of our own, Priss:

    2016-11-27 14.37.03 - Copy

    We eagerly await the medical knowledge that Priss will bring our way!

  • Big Data vs. Little Data

    All the attention, hype and money is pouring into Big Data. It's the way to get big budgets, lots of attention, and big salaries. Delivering real value to normal human beings is so mundane an aspiration that it is beneath the dignity of those involved with something IMPORTANT like Big Data to notice.

    That's where Little Data comes in. If you want practical benefits that can be enjoyed this year or next that improve the lives of normal human begins, then you should start putting effort into Little Data.

    Little Data and Big Data

    The most important thing about Big Data is not that it's big. It's usually not so big! What's important is that in the world of Big Data, what you mostly think about is Data and the fact that it's Big. It's a data-centric perspective, with all sorts of specialized software, equipment and knowledge. It's also a faith — everyone involved is certain that wonderful things will soon pour out of the Big Data pipeline — once we get this, that or the other thing worked out. Of course, we can't be sure what those wonderful things are — that's what's so great about Big Data, it affects everything!

    Big Data has gotten so Big that it has become a ripe target for parody, as in a recent Dilbert cartoon:

    Dilbert Big Data

    The most important thing about Little Data is not that it's little. Although it almost always is. What's important is that you mostly think about the people your organization serves, where and how they waste their time or get frustrated, and how to use computers and data to make things better for them. The problem is first identified, and then the relevant data is rousted up, organized, and made part of the solution.

    Here's the big problem: all the money, attention and prestige go to Big Data. Little Data? I suspect you've never heard of it before. Getting involved with it is not likely to be a career or prestige-advancing event.

    Little Data examples

    Here's an example of "pure" Little Data I encountered. I use USAA as my bank. I needed to send a wire. I went on-line to remind myself what the requirements were.

    USAA wire

    I gathered the necessary data and called the number they gave. The result of calling was a blizzard of Little Data efficiency and convenience.

    The first thing I heard was "I see you're calling from a number in your profile, David. Would you please say or enter your PIN code?" That's nice. It's a feature they've had implemented for a while. Saves time and makes me feel like they know me, even though I know it's all just software.

    After I entered the PIN code, I heard "I see you've been on the USAA website looking at wire transfers. Would you like to send a wire today, David?" Wow. Would I like to send a wire. WOULD I?? I sure would like to send a wire, USAA, thanks for putting two plus two together to make my interaction with you just that much more convenient. So I said "yes."

    Then I was immediately transferred to a wire transfer specialist, who already knew who I was and the wires I had already sent. Since I was sending a large wire, she went into a couple further security checks, and away we went.

    This is a single small example. It's not game-changing or earth-shaking by itself. But imagine if all organizations looked at things from their customer's point of view and found ways they could save time and increase convenience for them, like USAA obviously does. Little Data can change the world, very much for the better.

    Big Data Suppresses Little Data

    Every institution is surrounded by an extended thicket of barriers to customer service and efficiency that could easily be flattened by Little Data efforts. In most institutions, the thicket of barriers is ignored, while all the attention goes to the vague, never-ending moonshot of Big Data. The generally excellent Mount Sinai hospital system in New York is a good example.

    Mount Sinai has "mounted" a major effort in Big Data. They have been named one of the world's top ten innovative companies in Big Data!

    1 Mt Sinai

    Mount Sinai is the focus of a feature story in the New York Times promoting one of the many new books on Big Data.

    2 dataism

    The story leads with a profile of a new hire at Mount Sinai, "Dr. Data."

    Dr. Data leads a team of Big Data specialists working on important things that will transform medicine and health care! Soon! Well, someday, anyway.

    3 comput

    Meanwhile, what about the many patients who have problems now and are being treated at Mount Sinai now? Can we squeeze a bit of Little Data goodness out to help them? Apparently not.

    I've described some of my personal encounters with the lack of common-sense efficiency at Mount Sinai here and elsewhere. You can read about the important appointment I had with the specialist to determine whether life-threatening potential side effects of the cancer drugs I was taking were ramping up. The appointment was confirmed by email and robo-call. I travel a couple hours to get there. I check in. So sorry, the doctor is on vacation! More details here.

    More recently, I needed a refill for a prescription by this specialist. These are the drugs that I have proven are wrongly recorded in my Mt Sinai EMR, though I'm confident that my cardiologist somehow, somewhere has the correct data — she's on top of things. I tried the on-line system to get the refill. Fail. I got on the phone, holding more than 20 minutes before being cut off. I tried the phone again later, talked with one human briefly, but was eventually dropped after more than 30 minutes. No robo-voices telling me that I was in a queue, that the wait was approximately whatever; nothing.

    What could I do? Fortunately, I'm signed up for primary care at OneMedical, so I emailed (!!!) my doctor, explained the issue and sending the data, and the next day my refills were waiting for me at my pharmacy. OneMedical is an example of a health organization that puts effort into Little Data. Mount Sinai apparently thinks that putting effort into trivial issues like mine, whose fixes don't result in fawning newspaper feature stories, are beneath their dignity. All the effort needs to go to Big Data!

    Conclusion

    I like numbers, data, analytics and computers. I firmly believe they can be wielded to make our lives better. But while we're mounting forever-in-the-future moonshots with Big Data, it would be great if we could have a concerted effort to deploy Little Data to improve everyone's lives in the here-and-now. After all, whatever healthcare-transforming wonder Big Data comes up with, you're still going to need patients showing up to appointments with people who are actually there!

  • Healthcare Innovation: EMR’s and Data Quality

    Tens of billions of dollars are being spent to implement EMR's in healthcare. There's still a long way to go. Everyone seems to agree that EMR's will make things better than they were with paper. But it's hard to imagine that things will be better if the data is incomplete, inconsistent, and simply wrong.

    The big strategic thinkers and powerful people who push EMR use ignore this issue. I guess it's a detail, beneath them, unworthy of their notice. But for anyone who lives in the world of software, numbers and math, data quality is the foundation on which everything is built. Ever hear of "bad data in, bad data out?" It's true!

    I can run some personal tests on this issue because I'm being treated for a kind of cancer at one of the world's best hospitals, Mount Sinai. I'm getting excellent care and doing well. Mount Sinai is completely up to date with EMR's. It's clear from my experience to date that my excellent care has nothing to do with the EMR — arguably, the good care I'm receiving is in spite of the EMR.

    Let's look at some details. I recently waded through the hospital website to access my medical records. If whoever designed the website had tried to make it difficult for patients to access their records, they couldn't have done much better.

    I finally managed to get a PDF for an encounter. The document makes clear that the hospital's computer graciously deigned to share information with me, the patient:

    1 note

    The document makes equally clear that information is missing. What information isn't here? We have to guess. What an attitude.

    2 may not

    Think of an incredibly unpleasant, arrogant class of professionals. What did you come up with? My guess was lawyer. Even with lawyers, when you fire them and request your files they give them to you, minus snarky notes about how things "may be" missing.

    There was a section with my name and address. Also how to communicate with me:

    3 phone

    They included the identical number for Home and Mobile. You think the computer could have checked for that? This is one of the fatal flaws of the whole EMR approach: the patient is barred from entering and/or correcting his own data! In a sensible, modern system, I would have received an email or text asking me if this information was correct, and asking me to correct it if it's not. But an Enterprise EMR system with layers of security, bureaucracy, administrators, regulators and lawyers involved? Maybe next century.

    Now we get to my meds. Here they are. Notice anything?

    4 meds

    You may notice that information is missing from the second drug, losartan. What I noticed is that the dosage is wrong. What I have actually been prescribed is 100 mg tablets. This record is from the encounter with the cardiologist who prescribed the drugs! If it's wrong, anything can be wrong!

    In my case, it makes little difference, since I'm on top of things. But not everyone is so fortunate, and this is just the kind of error that could, with a different patient and drug, have awful consequences.

    Now let's look at my "social history."

    5 alcohol

    It's wrong too. And I'm not allowed to correct it. If I did use alcohol, it's missing the amounts. But I don't use alcohol. If it were correct, it would be incomplete; but it's incorrect.

    Finally, let's look at my plan of care:

    6 plan of care

    An appointment. But that's wrong too! The appointment I actually have is for a diagnostic procedure, not what's written here, and the follow-up with the doctor is just missing.

    Bad data wrecks everything

    You want benefits from Big Data? Nothing good comes from data that's bad, no matter how big it is.

    There is very little data exchange among EMR's, in spite of all the tens of billions of dollars that have been spent. Here is the latest stat from the government:

    14 percent share

    Do you think that's bad? In principle I think it's bad, until I consider all the inconsistent and incomplete piles of crap data that's sitting out there in EMR's. Then I think of the lack of interchange as being more like keeping the bad data in isolation so it doesn't wreck anything. And who's allowed to fix it? I'm certainly not allowed anywhere near it, even though it's my data.

    Conclusion

    What's the solution? Make health care providers spend even more time bent over computer screens than they do today, which is already excessive?

    The core problem is that our whole approach to hospital, health care and provider automation is rooted in the ancient approach to "enterprise software" that was created in the days of mainframes, and lives on in the incredibly expensive, ponderous and user-hating world of modern healthcare IT. The data will become accurate, complete and high-quality when the systems are built correctly, using modern techniques, and when they interact with all concerned parties — including patients!! — to get their jobs done.

  • Healthcare Innovation: Can Big Data and Cognitive Computing Deliver It?

    Most people seem to agree that healthcare is ripe for innovation, and badly needs it. Lots of people are talking up two potential sources for that innovation: Big Data and Cognitive Computing.

    I'm strongly in favor of data, the bigger the better. But is the Big Data movement going to make a difference? I'm strongly in favor of cognition, computing, and computing that is smarter rather than dumber. But is the Cognitive Computing movement likely to make a difference? Here's a summary of some thoughts.

    Process Automation and continuous improvement

    Here is a description of the core process automation process implemented by a company I've invested in, Candescent Health. It describes the process that can and should be applied to all of health care.

    The point isn’t that there’s data and analytics – the point is that there’s a closed-loop process of continuous improvement where actions are based on rules. This is the framework that is required to make anything happen. Without it, you can’t put your proposed new clinical action into practice with double-blind A-B test and see if the results of your analytics actually deliver benefits in the real world! Or even just deploy it!

    How about just making the basics work?

    Here is the story illustrated by Mt Sinai hospital about how everyone focuses on “innovation” and fancy new things, when just having the computer systems run reliability has a huge impact on patients – and unless those systems run, the results of fancy new analytics can’t be delivered to benefit patients.

    If the car won't start or run reliably, who cares how good the fancy sound and navigation systems are?

    How about making the computers work?

    I love data and analytics. But doesn’t it make sense to focus on getting the operational computer systems to actually run well before moving on to the fancy stuff?

    Paying top dollar for computers doesn't make them work

    In fact, just about anything you do with healthcare data that is going to be brought to the front line of care requires functioning computer systems to be able to pull off – the big healthcare systems pay Greenwich CT prices and get trailer park results.

    Clean data isn't easy to get

    Both data warehousing and the fancy new Big Data movement share the under-appreciated problem of getting good quality data in analytics-ready form. Sounds simple, but the difficulties make progress a grinding crawl on many efforts. See this for example.

    Big data sets tend to have Big problems

    Massive data sets have built-in problems that make it hard to get actionable results.

    AI: How about under-promise and over-deliver for a change?

    Skepticism about Cognitive Computing in health care is warranted. There is a rich history of over-promise and under-deliver for AI efforts in general.

    Real-world solutions waiting to be automated

    Meanwhile, there are proven gems in the medical literature just waiting to be disseminated to the front lines of health care via point-of-care computer systems that are languishing in journals.

    What can make a difference?

    There are lots of practical, tangible ways to make things better, in spite of all the obstacles to change pervading our healthcare system. Here are some examples of people doing the right thing, all them with investments by Oak HC/FT:

    • Candescent delivers better imaging results with less expense by applying basic continuous-improvement workflow automation.
    • VillageMD delivers better results with lower cost by feeding back results and advice to PCP’s.
    • Aspire delivers better results at lower cost for end of life – by having one person be in charge, managing everything from the patient point of view.
    • Quartet makes a difference by applying behavior health as needed to help other conditions.

    These companies embody some common themes:

    • Knock down the silos, have a patient-experience-centric point of view.
    • Applying common sense has huge benefits.
    • Focus on delivering results to the front line (patient) is hard but necessary.
    • A system of continuous learning and delivery is a pre-condition to delivering any results of analytics for patient benefit.

    Conclusion

    The big hot topics in healthcare of Big Data and Cognitive Computing are little more than fashion statements. Data, of course, is a good thing; so is having computers do smart things. But without doing some basic blocking-and-tackling and applying some practical common sense, a great deal of time, money and energy will be spent accomplishing nothing.

  • Big Data and Data Warehouses

    There’s a rapidly growing movement to take all the data that’s scattered throughout an organization, rationalize it, bring it together, and make it available for analytics that will help management to understand and ultimately to transform the business.

    This movement is taking place today. It’s explosive. It’s called Big Data and Analytics.

    This movement also started in the 1970’s, took root in the 1980’s, exploded in the 1990’s and is with us today. It’s called Data Warehousing.

    The fact that attention is slipping away from the thing called “data warehouse” and moving towards “Big Data” is a typical IT industry phenomenon. The problems are the same, the obstacles are the same and the solutions are the nearly the same – but the rhetoric and software are entirely different. Only a few savvy industry insiders are aware of the game that’s being played.

    The Enterprise Data Warehouse

    The Enterprise Data Warehouse, EDW, is the industry’s holy grail. It’s the place where all an organization’s data is stored for reading and analysis. All the data from the various operational and transaction databases is extracted, transformed as required and loaded into this database. Once there, it provides a “single source of truth” for the enterprise. Since the EDW is not running transactions, reports and analytics can be run against it at will, without harming ongoing operations. Single-purpose extracts can easily be made from it to support various projects.

    The EDW makes common sense. It became a major goal for many organizations, and many are still marching towards that goal. There’s just this one little problem: getting there. Then there’s a second problem: realizing the potential value.

    There are lots and lots or organizations that don’t have to worry about having an EDW that fails to fulfill its promise, because they just get bogged down along the way and never really get there.

    Why don’t you read much about this? Simple: who wants to admit it? And if the road to the EDW ends up trapping those marching down it in impassable mud, who outside the organization is ever going to know it?

    There’s a simple little acronym in EDW that is the tip of the mud-trap in which EDW gets bogged: ETL, which means Extract, Transform and Load. That’s what you do to get the data from where it starts to where it needs to be, in the EDW. Simple, right? Oh, if only it were…

    Extract, Transform and Load

    Before you even get to the E of ETL, you have to find the data. Then you have to get access to the data, with a properly jealous operational management group anxious that you avoid screwing them up. You have to get the whole thing to start with, and then a stream of updates.

    The “database” could be nearly anything. It could be a set of ISAM files running under CICS on an IBM mainframe. In which case, you need to get your hands on the source copy books that contain the data definitions to have any hope of making sense out of them.

    It could be something nice and modern, like Oracle. But you’d better start by getting a full dump of the schemas to have any hope of navigating among what could be many hundreds of tables. Then, without an E-R diagram that’s up to date, you’ll have little chance of making sense out of the tables. Then when you get down to it, you may discover a world of stored procedures initiated by access triggers, so that your innocent “just let me read the tables” turns out to have side effects. And then, getting the updates? You’ll soon find yourself either crawling to the DBA and begging for a change log to be shipped to you, or pleading to be allowed to program in some trigger-initiated stored procedures yourself, so you can get the updates with killing the performance of the DBMS, and avoid getting set upon by a mob of angry users.

    Phew! Now you’ve gotten through the E part of one source. What if there are dozens, or hundreds?

    And then the real fun begins. The also-innocent-sounding T phase of ETL. Because T doesn’t just mean simple, no-big-deal transforms, it also means take the customer names that are represented in different ways in different places, some of which have been updated or changed independently of the others, and make it so that when you end up with a customer in the EDW, that customer represents all of your relations with exactly one customer. Having three customer rows in the EDW for one customer kind of defeats the purpose of the EDW, after all.

    I’m just scratching the surface here, but perhaps you can get a feel for why the glowing promise of the Enterprise Data Warehouse so often ends up with the participants hungry and wounded in various ditches along the path to the promised land.

    Big Data

    Forget I ever mentioned “data warehouse”, ETL or any of that other stuff. Bzzzzzzttt! New subject!! Brand new!!! NOT related to anything else in computing, completely WITHOUT history, stemming from this brand-new EXPLOSION of data that’s just EVERYwhere. It’s the Big Data movement! Where we take all these mountains of data that are just piling up useless and turn them into business GOLD. You’re already late – everyone else is already with it. There are books, conferences, experts, the whole nine yards!

    You’ve got to get a Data Lake, and fill it up with data. Then you’ve got to rev up your Hadoop cluster and start cranking out those nuggets of business gold from all that data.

    Except, hmmm. I’ve got to find the data. Get access to it. Get it once and then get a feed of the updates. All this data from different places, it doesn’t match up well, I’ve got to clean it up. Well, maybe I’ll just dump it into the Data Lake and let the Hadoop nerds worry about it. They’ve got all these servers at their disposal, maybe the servers can work at night cleaning everything up.

    Gulp. I just looked at my nice, fresh, clean Data Lake. It’s a Data Swamp! There are snapping turtles and water moccasins swimming in there. Don’t. Like. This. Maybe I can get a transfer.

    Conclusion

    If “data warehousing” were a big success, it would have kept its name and would now handle what we now call “big data.” But no. “Data warehousing” projects are often classic IT projects that drift on forever, confronting obstacles and rarely producing results. Big Data is the new kid on the block. There aren’t (yet) decades of frustration and broken promises associated with it. Give it time. Every obstacle that DW projects encounter also rear up to challenge Big Data projects, and until solutions are found, returns will be equally elusive. And even then, there are conceptual flaws in most Big Data efforts.

  • Fatal Flaws of Big Data

    The message appears to be: if you're not way into Big Data, you're missing out on important things! For vendors and job seekers, I'm sure this is true, without reservation. For the companies that wish to benefit from the investment? Maybe not.

    The Big Data Trend is BIG!

    There's one thing for sure about Big Data: it's a Big trend.

    We have been assured that Big Data is now the driving force in computing.

    If you scan through the books, conferences and other things whose focus is Big Data, it's clearly a major fashion trend.

    Whenever something like this catches first, everyone wants to jump on. Lots of people talk about their own "big data" that, on a closer look, isn't so big after all.

    And generally when you look at it more closely, Big Data doesn't look quite so cool.

    Is there something wrong here?

    How much better is data that's BIG compared to MEDIUM or SMALL?

    The killer assumption behind all the Big Data excitement is that Big is better than normal-size data — lots better. Makes sense, right?

    Not so fast.

    Let's spend a little time thinking about the core issues of data coverage, integrity/quality, and probability.

    Probability

    Today, we've got X data. Let's assume that, with Big Data, we've now got 100X data. Are we 100X better off?

    Let's start with something simple and universal: flipping coins. Suppose we place ads. We make money when the coin comes up heads, and lose money when it comes up tails. Our data people tell use that the odds of getting heads are 0.5, with a certainty of 0.1 — i.e., the chance of it coming up heads is probably 0.5, but it might be as little as 0.4 or as much as 0.6. Now we have 100X more coin flips to apply to our measurement. Great, now we're really going to start marking money!

    They come back, sweaty and proud with the answer: the probably of getting heads is 0.500, with certainty of 0.001 — i.e., the chance of it coming up heads is probably 0.5, but it might be as little as 0.499 or as much as 0.501. Wow, we've increased our level of precision massively! How much does that increase the money we make? Hmmmmm.

    Quality/Integrity

    Maybe the problem is that we just got lots more data points about the same thing. It didn't broaden our knowledge. Maybe we need to expand, check out the odds for not just nickels, but also dimes and quarters. Hmmmm. Let's get more ambitious. Let's track users, not just on our website, but also on 100 other websites. Tell the programmers to get going! We're going to be rolling in money from Big Data!

    The programmers seem to be having trouble matching people over different web sites. Are all these people who claim to be David Black the same person? What about that David B. Black guy? And there appear to be two really different patterns of use coming from the same IP address — maybe someone else is sharing the computer? And I just discovered that there's a David Black who appears to use the internet from Manhattan, and a David Black who uses it from some place out in New Jersey. We already know there are multiple David Blacks. This could be one person or two. Which is it? This is getting hard.

    Darn. I thought all I had to do was get loads more data and a Hadoop cluster and the money would start pouring in. Getting all that data to match up and make sense is harder than it looks. And then, when I've done it, is all I'm achieving increasing my level of certainty about what I already knew?

    Data Coverage and lift

    Alright. I've got my 100X more data. I've FINALLY sorted it out so it's high quality and matches up. Now I've got to make sure it really broadens my knowledge and gives me uplift in my results.

    So far, all I've been doing is looking at my customers' actions. I bet if I look at demographics and social media — that's lots of data, surely it qualifies as "Big," I'll get better results. Big Data team — mush!

    Darn, darn, DARN! Yeah, all this big data stuff changes what I offer to whom for how much — but it's not making a whole lot of difference in my results. And I'm getting hammered with complaints from people who want me to stop making offers to their kids, and old customers who wonder why we don't love them anymore. Yeah, we're getting 5-10% uplift, but we're losing at least that much from our old business, not to mention all the costs we've added.

    Who's making money from the Big Data stuff? It must be the consultants, the vendors and the conferences. It's sure not me.

    Of course, I could just patch it all up, start going to conferences, bragging about how I'm an expert, and maybe I'll get a great new job. But it would be based on a lie. I'm not that kind of person.

    Conclusion

    I love data. I love exploring it, analyzing it by all available means, and understanding it. Evidence-based solutions are the only ones I'm comfortable with. Everything else is just baseless faith. If I can use math optimization, machine learning or something else to do a better job than a person could do, I'm all for it. If I can get additional data and that data will help me get better results, bring it on!

    But "Big Data" is not in principle better than "enough" data. Too little is not enough. More than you need to get the job done is a waste. Just like Goldilocks, you should want the amount of data that's "just right."

     

  • Big Data, the Driving Force in Computing!??

    Big Data is awfully important now, and it's poised to become the driving force in computing in 2015! If you don't think so, just look at this:

    Big Data email

    NOT!! I've pointed out how arithmetically-challenged most things called "big data" are. I've pointed out how it's mostly a technology fashion trend with little substance. And how when you dig into it, "big data" often isn't big. Or meaningful, or relevant.

    You might think, with the emperor strutting around butt-naked ugly, we would be embarrassed and turn away. Sadly, it appears to be getting worse. Computing is supposed to be dominated by all sorts of number-centric nerds. The Big Data trend is strong evidence that trends in computing are every bit as science-based as fascination with the Kardashians.

     

  • Storage For Big Data

    In Big Data, computers and storage are organized in new ways
    in order to achieve the scale required. The major storage companies just assert, without justification, that their old products are just fine. They're not.

    Big Data is way bigger than the biggest
    computers. In Hadoop, you solve the problem with an array of servers that
    can be as big as you like. Hadoop organizes them for linear scaling. While most
    storage vendors continue to plug their old centralized storage architectures
    and claim they’re good for Big Data, the only solution that’s actually scalable
    is an array of storage nodes, directly connected to the compute/storage nodes.
    Hadoop organizes the computing to use such an array of compute and storage
    nodes optimally, and it can grow without limit, for example to thousands of
    nodes.

    Hadoop has its own file system and database. The NAS systems
    pushed by legacy vendors just add expense and slow things down. The old
    centralized controller SAN systems are expensive and not scalable. Some vendors
    promote how they are good for Big Data because they use lots of SSD – but
    that’s way too expensive for Big Data. Others promote hybrid systems, but make
    them affordable by playing tricks like compression, which just add expense and
    slow things down.

    Exactly one vendor has a storage system that is best for Big
    Data: X-IO. X-IO has exactly the kind of storage nodes that Hadoop wants. Its
    independent storage nodes are linearly scalable, without limit. Its software
    makes spinning disks deliver at least twice the performance compared to any
    other system. It can optionally incorporate SSD’s for even better performance,
    without using the distracting tricks used by others – you just get better
    blended performance, without effort. Because of the inherent reliability of the X-IO ISE units, you don't need as many copies of the data.

    If it's Big, if it's Cloud, if it's virtual, the X-IO is the place to go for storage.

  • In storage, there is X-IO, and then there are all the others…

    In normal times, when there is no major technology
    disruption in the market, there are two categories of storage companies.

    Most storage revenue goes to the big names everyone knows
    (EMC, NetApp, Etc.). These companies have comprehensive storage solutions and
    services to meet nearly any need. Their products are solid and meet most
    mainstream needs. They don’t innovate much and aren’t the most cost-effective,
    but they work.

    A good deal of attention in the storage industry goes to the
    hot new companies, which are all about the latest technologies (e.g. SSD) or
    features. They usually don’t do the old things as well as the established
    companies. But by focusing on the hot new thing, they often do that one thing
    pretty well, and so appeal to the usually tiny part of the market that feels
    the corresponding pain. If they get market momentum, they are usually bought by
    an established vendor.

    This is the way it works. The established companies take most
    of the revenue and do little innovation. There is always a flurry of new
    companies trying to innovate, sometimes getting traction, and getting absorbed
    by the established vendors.

    Then there are technology disruptions. That’s when the rules
    change. Suddenly the comprehensive product lines of the established vendors
    don’t meet the needs of the emerging landscape very well (in spite of the
    furious efforts of their marketing groups to claim they do), and most of the
    new vendors don’t get the new situation and continue to do little but exploit
    new devices or add features onto the existing pile.

    Today’s Technically Disrupted World

    That’s the situation we’re in today, with the combined
    technology disruptions of data centers employing virtualization, moving to the
    Cloud, and attempting to exploit Big Data. In addition, there is a new storage
    technology, flash (SSD), which vendors are scrambling to exploit. The situation
    is confusing for buyers and chaotic for vendors, since most vendors try to act
    as though nothing fundamental has changed. But it has!

    The Cloud is all about reliable, low-cost self-service, with
    tremendous automation and integration. Service, capacity and performance need
    to be available on-demand, with no human intervention. Everything needs to be
    able to grow and shrink as application needs change, with a sharp eye to
    capacity utilization, SLA's and costs, since it’s easier than ever to switch Cloud vendors when
    one stumbles or is simply no longer competitive.

    Virtualization is a key part of achieving Cloud goals, and
    virtualization changes the rules of the systems game. Functions that were
    traditionally part of storage are now performed as an integral part of
    operating systems and/or virtualization software, to make them more agile. This
    also drives the movement to software-defined networking and storage.

    Big Data is the same only more so, with its emphasis on
    linearly scalable arrays of compute nodes and storage nodes.

    In response to this massive technology disruption, many
    companies realize that brand-name vendors no longer make much sense, and are using inexpensive JBOD’s for storage and depending on
    massive replication by the file system to provide reliability, typically making
    a minimum of 3 whole copies of the data, before backups, in order to assure
    availability. If the alternative is an old-technology NAS or SAN, this is a
    smart idea, which is why its use is growing so quickly.

    X-IO

    And then there is X-IO. While X-IO is a storage company,
    it’s different than all the others. It was built for a vision of computing that
    we now call “Cloud.”

    When X-IO was started about 10 years ago as the Advanced
    Storage Architecture division of Seagate, its goal was to build highly compact,
    efficient and reliable storage building blocks using Seagate HDD’s. While the
    rest of the storage world was ignoring the details of the devices on which it
    was built, piling on features and management systems that have become obsolete
    in the Cloud world, the ASA group was inventing the technology of the storage
    “brick,” now amounting to over 50 patents and a great deal of field-hardened
    code that delivers more of what Cloud needs than any existing system, by far.

    All storage vendors, whether established or emerging, use
    the same drives from the same couple of leading vendors, mostly Seagate or
    Western Digital (WD). All of them except X-IO package them in roughly the same
    way and throw some features on top of them to “differentiate” themselves from
    the other guys who use the same disks. It’s just as though all cars had one of
    two different kinds of nearly-identical engines in them – each of the car
    vendors would try to distract you from the engine, and try to get you to
    appreciate how wonderful their steering wheels or cup holders were. That’s even
    true of NAS and SAN, which seem so different, but really have the same engines
    (disks) in them – it’s like one has front-wheel drive and the other rear-wheel
    drive, but under most conditions, their speed, fuel efficiency, acceleration
    and service frequency are identical.

    The only storage vendor that is different is
    X-IO, and X-IO’s difference just happens to be on all the dimensions that
    matter most for the new world of Cloud, virtualization and Big Data.

    X-IO’s Difference

    First of all, X-IO doesn’t build feature-encrusted storage,
    like a “trophy car.” It’s basic storage, a storage building block or brick,
    ideal for plugging into nodes in a Cloud server farm under virtualization
    control, or a Hadoop cluster.

    Second, and most important, comes from its heritage as part
    of Seagate. While X-IO uses the same Seagate drives that other vendors use, all
    the other vendors just plug the drives in and proceed to concentrate on everything but the drive. X-IO’s technology, in sharp
    contrast, is all about making that drive perform at its very best. You wouldn’t
    think there would be much that could be done. But there is! X-IO reduces the
    error rate of the drives so much (more than 100X) that they can be sealed in
    containers, which makes them take much less space, consume less power and
    generate less heat than the same drives in any other system. Then the X-IO
    software actually gets more
    than twice
     the I/O’s per
    second (iops) from each drive than any other vendor.

    Let’s think about a car rally. Most of the cars will vary
    greatly in size, shape, color and gizmo’s. The X-IO car will be the plain one.
    Imagine them in a distance race. Most of the cars will overheat or have to stop
    for gas pretty often. Only the X-IO car will never overheat and get vastly
    better mileage than the others. Many of the other cars will break down along
    the way. X-IO won’t. Here’s the amazing thing: the X-IO car will cross the
    finish line in half the time of its nearest competitor.

    Now let’s think about sending an important package. Using
    normal cars, you’d better send 3 identical packages by different routes to make
    sure it gets there. With X-IO, you only need one car, and it will get there
    faster than any other car, using less fuel.

    In the world of Cloud, this translates into not having to buy expensive SSD drives to
    get performance, though X-IO has them available if you need to go even faster
    than X-IO normally goes. It translates into not having to over-provision to get
    performance. It translates into not having to store 3 or more copies of
    your data to assure it’s still there tomorrow. It translates into buying a half or third of the number of racks (or rows!) you
    would normally have to buy in order to make a given amount of data available at
    a given performance level. It translates into dramatically lower operating
    costs for those racks, which at Cloud scale and Cloud competitive pricing can
    be the difference between growing profitably and losing to the competition.

    No other storage vendor offers these benefits. No one but X-IO.

    Conclusion

    The “cloud” as we know it today didn’t exist when the ASA
    division of Seagate started inventing the deep technology that has now matured
    in X-IO. But its simple mantra of getting more value out of devices was a
    unique quest. No vendor has equaled it, and no one is even close. As new drives
    are released, the X-IO advantage will persist as a multiplier on whatever
    Seagate ships. All the other vendors will plug Seagate drives into their
    systems and try to distract you, drawing your attention to “anything but” the
    actual characteristics of the storage – its performance, space and power use,
    reliability. These thing are old news in the old world of storage, but they’re
    the only thing that matters in the new world of Cloud. Which is why there are
    all the storage vendors – and then there’s X-IO.

  • Human and Inhuman Analytics

    While people talk about analytics in general, there are really two distinct varieties: human analytics and inhuman analytics. First, there is analytics for and by humans, i.e., numbers, tables and graphs designed by humans for human consumption and consideration. Second, there is algorithmic analytics, originally designed by humans but then set off to make observations, decisions and perhaps actions on its own. I dub this "inhuman analytics," because that's what it is. It is incredibly important to understand the differences between these two things, related in name but little else.

    Human Analytics

    When most people think about analytics, they're usually thinking about things like Data Warehouse (DW), Online Analytic Processing (OLAP), Business Intelligence (BI), and related subjects.

    This is a subject that is broad and deep, with many products and vendors that have evolved over time. But there is a simple unifying theme: these are tools intended to provide information to people, often in the form of graphics, so that those people can understand what's going on and take any action that may be appropriate.

    Oracle, for example, has a wide variety of such tools:

    Oracle BI
    Microsoft also has a variety of such tools.

    Microsoft BI
    Note that both companies illustrate their approach using screens and people. That's what this type of analytics is all about.

    There are a wide variety of BI tools from many vendors, in addition to open source.

    Inhuman Analytics

    Inhuman analytics, a terms that no one else uses, so far as I am aware, is a whole different thing. This is also a subject that is broad and deep and undergoing constant innovation. It includes such diverse subjects as machine learning (ML), advanced statistics, operations research (OR) and related subjects.

    In general, inhuman analytics are far more specialized than human analytics. They are nearly impossible for anyone but a specialist to understand. There is often lots of math involved. They are not primarily about presenting information so that it makes sense to human beings — they are about figuring stuff out that most humans wouldn't be able to figure out at all, or figure it out with a precision that exceeds human capability.

    Because of this, there aren't great pictures to illustrate inhuman analytics. But here's an illustration of the ML process from one company's ML toolkit:

    ML
    Inhuman analytics are behind a large number of modern innovations, though they rarely get credit for it, since the way they work is essentially like magic to most people This is a vibrant subject with a rich history. I suspect I will come back to this in some future post.

    Conclusion

    Human analytics has many uses and is a good thing. The visual tools it emphasizes enables knowledgeable and motivated people to explore and understand a data set, and to track it over time. Sometimes you can even discover new things, particularly in the early stages of understanding and optimization

    However, inhuman analytics are the serious, heavy-duty tools to help derive value from data. They can and regularlly do figure things out and solve problems that are beyond human capability, even with the aid of human analytics.

    Human analytics has its place. But it's no substitute for inhuman analytics for serious value creation.

     

  • The Big Data Technology Fashion

    Where there are people, there are fashions. Why should technology be immune? The current fashion of "big data" is a classic exemplar of the species.

    The Books

    Books are a good place to observe the common themes of technology fashions. You'll see patterns that resemble the ones I previously pointed out for project management.

    I think it's fair to say it's not a legitimate technology trend if it's not covered in an "X for Dummies" book.


    BD Dummies

    Similarly, it's got to be big. Be Revolutionary. Transform lots of stuff.


    BD revolution

    Its got to be a big, scary thing that needs taming.


    BD Taming

    For any fashion trend, its important to make sure that other things are hitched to its wagons.


    BD Analytics

    Let's not forget that, if it's worth paying attention to, there's got to be a way to make money from it.


    BD Money

    It's never too soon to start adding layers of process and paranoia to it, to assure that costs skyrocket and that hardly anything ever gets done; in other words, governance.


    BD Governance

    Finally, anything but anything has to have a human side.


    BD Human

    I swear, I sometimes think there's a central planning committee for technology fashions. They plan when the next new label on something old and not all that interesting is going to come out, grab their standard set of titles, and pass them out to people to write the books.

    But then, I guess it can't really be that organized, because there are usually so very many books, each of them covering the same small set of themes over and over and over, with slightly different language. The themes always seems to include:

    • X is revolutionary; it will change lots of important stuff.
    • X is big and scary, and you need help to tame it or bring it under control
    • There are lots of ways to screw up doing X, so you need to pay lots of money for Y to get it right
    • You're a Dummy, but I'll help you understand what you need to know about X anyway.
    • X has a human side

    The Conferences

    Things aren't that different with conferences. They take the themes established in the books and embellish them a bit.

    There are conferences for people who work in particular sectors.

    BD Public

     

    You can't pass up an opportunity to learn from the very best.

    BD learn from best

    Who can resist going to a conference which cuts through all the crap and helps you do stuff?

    BD how to

    Anyway, you get the idea — there are lots of conferences. The themes are predictable, even without the aid of big data or predictive analytics. Because they apply to any technology fashion trend.

    Conclusion

    Technology fashions — they are forever in fashion!

  • “Big Data:” Some Little Observations

    "Big Data" is everywhere. If only because of this, it is important, like the way Paris Hilton


    220px-Paris_Hilton_2009
    is famous for being famous.

    What's included in "Big Data?"

    If your concern is storing, serving or transmitting it, you don't care what kind of data it is — data is data, a pile of bits.

    But not all data is created equal. The easiest way to understand this is to break all the bits into relevant buckets. By far, the largest bucket is for image data (including both still and moving pictures, videos). While the ratios vary, it's not unusual for there to be 100 bits of image data for each bit of other data.

    While there's not a commonly accepted terminology, all the rest of the data can be understood as "coded" data. This again falls into two categories. The larger portion is "unstructured" data, things like documents, blogs, e-mails and most web pages (except for the images and videos on them). The smaller portion is "structured" data, which includes all databases, forms and anything else that can show up in a report.

    When people talk about "big data," they could be talking about any of the above, but mostly people talk about it because they want to extract actionable information from it, and the source of most actionable information is structured data. So in the vast majority of cases, when people talk about "big data," they're talking about structured data.

    Did Data used to be Small and Now it's Big?

    Think about a bank statement. There's a little information about you at the top, but most of the statement is probably taken up by the transactions — money moving into and out of the account. In general terms, this is the action log, the transaction history. This pattern of having an account master and detail records is a common one.

    Now think about a web site. The site itself is like the bank statement, and the record of people visiting and intereacting with it is like the transaction history, generally known as a web log.

    People generate far more transaction records when interacting with
    the web than other human activities; for example, you probably click on
    hundreds of pages for each bank transaction you make. So the amount of data can be pretty big.

    The simple answer is: before the web, transaction data wasn't very big, and with the web, there's a lot more of it than there was before. Of course big data isn't just about the web; but the web has certainly gotten people to pay attention.

    So where did "Big Data" come from?

    It would be interesting to do a cultural history, but I suspect that the current interest in "big data" stems from the following factors:

    • Companies that pay attention to web logs get information about visitor behavior that can be used to make more money.
    • Internet advertising companies have done exactly this for years, and are getting really good at it.
    • Shockingly, most people don't analyze their data to improve their behaviors.
    • A closed loop system in which the results of your actions are used to enhance future actions is the clear winning strategy.
    • This requires (gulp) collecting and analyzing the relevant data, which is far larger than most people are used to dealing with.

    Thus the term "big data," which currently applies to just about any body of transaction data.

    What's "Big" about "Big Data?"

    Let's start by applying one of the fundamental concepts of computing to the question: counting. One of the first disk drives I got to use was a twelve inch removable pack developed by IBM:


    800px-IBM_2315_disk_cartridge.agr

    Its capacity was about 1MB. While that may sound small by today's standards, let's put it in perspective. Each byte is the equivalent of a character that you can type. Using a generous measure of 30 wpm and 5 cpw, that's 9,000 characters in an hour of continuous typing with no breaks, so the disk above has a capacity of more than 100 hours of continous typing. That's one reason I thought the disk's capacity was huge — it easily held the source code for the FORTRAN compiler I wrote at the time, which was about a year's worth of work!

    Now let's get modern. Drives have gotten smaller while holding more and more. Here's a good visualization of the progression:


    800px-SixHardDriveFormFactors

    We're now at the point where truly small drives (1 to 2.5 inches) hold massive amounts of data; 1TB or more is common.

    How much is that? Remember, it would take 100 hours of continuous typing to fill up the large disk pictured earlier. How much space would those drives fill if you had 1TB to store? That's about 1 million of the older disks; if you packed them tightly, they would fill a room that was about 100 feet long, 100 feet wide and 10 feet high. And I would have to type for 100 million continuous hours to fill them up. Now, that's big data.

    Now that we've got a sense of how big a TB is, let's get real.

    On a good day, this blog might have 100 page views, each generating a server log record. Such records vary in length, but let's say they average 100 bytes in length each, or 10K bytes a day. Not much.

    Let's say I caught up to the Washington Post, a site which is in the top 100 in the US. It gets about 1 million page views a day. That would be a mighty 100 million bytes a day of raw server log data. 10 days would add up to a GB of data, which means that ten thousand days, about 30 year's worth of data would fit on one of those physically little drives pictured above that holds just 1TB of data.

    The Washington Post is a major site; top 100. Their web transaction logs are the biggest data for analysis they've got. And here's what 30 year's worth of their data will fit on:

    Hitachi-Travelstar-Z5K500-Thinnest-Hard-Drive

    That's what they call "big data." This is why I instinctively drop into cynical mode when the subject of "big data" comes up. It just isn't usually very big!

    How much data do you need?

    It depends on context. If you're a website like Facebook offering a free service holding user's data, the answer is simple: you keep as much of the user's data as you feel like. You can (and if you're Facebook, regularly do) throw out data any time you feel like it, or just drop it on the floor and lose it because your programmers weren't up to dealing with it.

    If you're a money-making business that depends on data, you could probably run your business better if you

    1. Kept all the data
    2. Analyzed it
    3. Came up with useful observations, and
    4. Changed your behaviors accordingly.

    But most businesses don't do this very well, if at all. And they are feeling increasingly guilty about it. Thus the marketing drum-beat for selling everything that can possibly be labelled "big data."

    Sarcasm aside, the fact is that most businesses don't need much data in order to perform wonderfully useful analyses. The reasons are simple:

    • The things that matter the most are things you're not doing yet. The data you've got is historic. It's like if you're a comedian and the audience doesn't laugh much; no amount of big data analysis of audience reaction will help you come up with better laughs.
    • The impact of big potential changes will be seen in lots of your data. Go back to statistics 1.01. How much data do you need to see that the coin you're flipping isn't a fair one? Only enough to prove that the 2 out of 3 times it comes up heads isn't a fluke.
    • In the end, how many changes can you realistically make? Hundreds? How about rank ordering them, finding the most important ones first, then moving on from there? You'll quickly get to diminishing returns.

    Finally, more important than anything else, is getting into an experimental, data-driven, closed-loop system. This is always the key to success. It how organizations become successful, get more successful, recover from trouble, and stay on a winning path.

    Conclusion

    For better or worse, "big data" is likely to be with us for awhile, at least as a technology fashion trend. Like all such fashion trends, it's a useful occasion for getting us all to check if we're putting our transaction data to its most optimal use in keeping us on the track we're on and getting us onto improved ones.

     

Links

Recent Posts

Categories