The
overwhelmingly most important job of any tech company’s top architect is …
[drum roll, please]… assuring the reliability and responsiveness of his
company’s product and/or service.
This
may not be what you think. It often is not what the top architect thinks. If
that is the case, all I can say is, think again, and think better this time.
I
find that the technical staff in computer-enabled companies have all sorts of
ideas of what things are “important.” Generally speaking, “importance” seems to
be correlated with distance from day-to-day events, distance from the data
center, distance from anything concerning “operations,” and distance from the
concerns that existing customers have working with the product/service the
company provides in day-to-day use. Anything “strategic” – that’s in; that’s a
proper concern of the top thinkers in the company. Anything “tactical” or that
could conceivably be in the realm of the customer service or data center
operations group – that’s out; that’s just a waste of time for the company’s
best minds.
However
common this way of thinking may be, I still find it to be not just bizarre, but
perverse. The reason, simply put, is that it’s completely out of sync with what
customers think the most important issues are.
If
your product or service:
- just
doesn’t work - goes
off-line unpredictably - slows
way down at key times of day - old,
reliable features suddenly stop working, or change their behavior unpleasantly
What
do you think will happen to your customer base? If you think you don’t care
because you have such a great flow of new customers, think for a moment about
what causes that flow: do you think reputation, references or word-of-mouth
might have anything to do with it? Do I have to mention Toyota to remind you
how fast a great service record can get destroyed?
Now,
let’s get to the crux of the issue: when those bad things happen, whose fault
is it, and whose actions and decisions are most highly correlated with creating
the conditions that led to the problem? To make this simple, let’s turn again
to Toyota.
- Is
the driver (user) at fault due to improper use? Sure, that makes sense, it was
the users that made the site crash. - Are
the technicians (for example in the data center) at fault? Sure, they can screw
up; but the best-architected systems don’t depend on technician action to
achieve reliability and response time. - Are
the customer service people at fault? Hmmm.
Here’s
the reality: while anyone in an organization can screw up and cause problems
for customers, the most serious issues I’ve seen in companies are the direct
result of architectural decisions or lack of attention/involvement. This
includes response time, flakiness, down time, and the other things that drive
customers nuts, not to mention drive them to your competitors.
I’ll
give a couple of illustrations.
A
company’s web site was down for an hour or more at a time. Repeatedly. The only
solution is to re-build a key component and its database on a new machine and
bring it on-line. Right away, the finger of guilt pointed at data center
operations for failing to deliver. But the root cause of the problem was a key
application component that had no fail-over capability. How could that happen?
Simple. The company’s top architects failed to make scalability and fault
tolerance the non-negotiable, number one priority when selecting this key
component. Instead they concentrated on all sorts of other things. Are the data
center people at fault here? Hardly. They just had this crippled software
tossed at them, and did their best to hold off the inevitable disaster.A
company was in a vice grip of pain. Existing customers were complaining that
the service provided was slow and faulty. New partners were putting on the
squeeze for new releases of functionality that they felt were crucial for
winning business. The more new code the company released, the more bugs and
customer problems were created. When the company tried to slow things down to
stabilize the service, the angrier the new partners got, who accused the
company of failing to meet its commitment to them. The data center staff
exploded, the QA staff grew, consultants were crawling around, and life was
miserable. The root cause? Again, simple. The company’s top architects had
completely ignored the whole release and go-live process, and built a software
system that was designed for a set of unchanging requirements, instead of the
fluid and constantly changing reality of the company’s customers and market.
The whole nightmare was an architectural side-effect, and the solution was a change
in architecture – the good, practical kind of architecture that encompasses
everything about the company’s product/service, including releases and the data
center.
I
think the message is a clear and simple one: if your top minds are not already
focused on the company’s most important issues, viz., those that are most important
to your customers, get them focused on those seemingly mundane, tactical,
near-term, nuts-and-bolts concerns. Now.