Don't scale: 99.999% uptime is for Wal-Mart - Signal vs. Noise (by 37signals)

Don’t scale: 99.999% uptime is for Wal-Mart David 06 Dec 2005

99 comments Latest by e.senguttuvan

Jeremy Wright purports a common misconception about new companies doing business online: That you need 99.999% uptime or you’re toast. Not so. Basecamp doesn’t have that. I think our uptime is more like 98% or 99%. Guess what, we’re still here!

Wright correctly states that those final last percent are incredibly expensive. To go from 98% to 99% can cost thousands of dollars. To go from 99% to 99.9% tens of thousands more. Now contrast that with the value. What kind of service are you providing? Does the world end if you’re down for 30 minutes?

If you’re Wal-Mart and your credit card processing pipeline stops for 30 minutes during prime time, yes, the world does end. Someone might very well be fired. The business loses millions of dollars. Wal-Mart gets in the news and loses millions more on the goodwill account.

Now what if Delicious, Feedster, or Technorati goes down for 30 minutes? How big is the inconvenience of not being able to get to your tagged bookmarks or do yet another ego-search with Feedster or Technorati for 30 minutes? Not that high. The world does not come to an end. Nobody gets fired.

Alistair Cockburn taught me a great name for this in Agile Software Development: Criticality. The criticality of your average “Web 2.0” application is one with loss of comfort as the result of something going wrong. Unlike the criticality of the credit card processing for Wal-Mart, which is probably at the level of essential money.

So the short summary is that it’s not a profitable decision to shoot for 99.999% availability for tagging bookmarks. But that’s not nearly as important as the real lesson:

Before you have users, it’s a waste of time ensuring that they can always get to the service

A project that spends a lot of time upfront on scalability is the one that can’t afford to fail. And a project that can’t afford to fail is an inherently uninteresting idea for a new growth business. You can’t carry around the label of Zero Risk (TM) and expect to be the next big thing. It will focus your energy on all the wrong things.

What you need is to embrace the goal of getting someone to care enough about your product that they’ll actually complain when its down. Once the first complains starts to trickle in, you know you’re riding something right, and then you start caring about adding another percentage point or two.

Om Malik thinks that the running-with-scissors approach of most start-ups is a sign of a bubble. Awahh? The bubble was when people thought they needed to spend $3 million dollars buying Sun servers and Oracle databases to build a site for wedding invitations.

The business smarts is when you don’t blow the farm before the crap shot has turned sure bet. Fail cheap. Because odds are you’re going to. And you need to have your shirt for the second round.

So. Don’t scale. Don’t worry about five 9’s or even two. Worry about getting something to a point where there’s reason to worry about it.

UPDATE: Dare Obasanjo from Microsoft talks about how even the big guys have these issues from the position of a company that launched MSN Spaces and grew it to 3x LiveJournal in 1 year:

The fact is that everyone has scalability issues, no one can deal with their service going from zero to a few million users without revisiting almost every aspect of their design and architecture.

Spot on.

99 comments so far (Jump to latest)

David Heinemeier Hansson 06 Dec 05

The preemptive strike: Not worrying too much about scaling doesn’t mean building Basecamp on an Access database with a spaghetti soup of PHP making 250 database calls per request. It simply means having a profitable price per request, chilling the f*ck out, and dealing with problems as they arise.

Rube 06 Dec 05

Okay, this is spooky. You post this on the very same day that, for the first time in more than a year, I was unable to access my gmail for a majority of the day. I don’t know where that falls between 99.99% vs 98%, just thought it was an amazing coincidence.

Tory 06 Dec 05

I disagree. Technorati had a lot of downtime and developed a reputation for being unreliable. When something like that happens, people find a substitute, and then they stick with it.

I’d go even further and say that uptime is more important if you’re the small guy. If Wal-Mart’s credit card processing pipeline stopped when I was in the middle of trying to buy something I’d probably be pissed off. They’d lose a lot of money and sales (until the system came back online), but they’ve got a good reputation. It wouldn’t stop me from coming back to them.

That’s not true for small business. When you’re the small guy people are taking a chance by coming to you. If the first impression they get is that you’re unreliable then why would they bother coming back?

Ayse 06 Dec 05

And even Walmart doesn’t need 99.99% uptime. Most of their stores close, or if they’re open 24 hours, there are hours during the night when they are not processing credit cards, or when a 15-minute wait for the machines to come back online is acceptable. There’s plenty of time in there for rebooting a server, upgrading a switch, or what have you.

What I have found is that most people who want 99.99% uptime are completely unwilling to pay for it. Instead, they try to badger their staff into donating their time, free of charge. I totally get that friction in business causes innovation, but I see things like VPs yelling at sysadmins because a switch died at 2am and the site was offline for 15 minutes.

On the other hand, I (have no option but to) regularly use a web service that for mysterious reasons has to go offline every night at eleven and stay offline until around 7am. If your servers need that much maintenance, you have a problem.

(BTW, Rube, I could get to gmail all day, no problems.)

We’ve generally found that people are quite generous about forgiving the occasional downtime if you explain what happened, apologies, and do it quickly. Be honest, be human.

Sure. You can’t have a uptime of 90% and still expect people to love you. Nobody is bringing that into questioning. Just that the primary objective is to get people to care for as little money as possible (since if they don’t, it’s all wasted anyway). That doesn’t leave room in the budget to ensure 99.999% on day 1, 2, 3, or 90.

Mike D. 06 Dec 05

A project that spends a lot of time upfront on scalability is the one that can�t afford to fail. And a project that can�t afford to fail is an inherently uninteresting idea for a new growth business.

Ok, I agree with the great majority of this blog entry but looped through the above paragraph several times and still can’t quite parse it. The second sentence in particular. Are you saying:

a) And a project that can�t afford to fail is *by definition* an inherently uninteresting idea for a new growth business.

b) And *an example of* a project that can’t afford to fail is an inherently uninteresting idea for a new growth business.

Or neither? B seems more likely but the language is a bit ambiguous and is throwing me off. A would seem to make no sense at all.

With regards to the rest of the entry, which is great, we had a similar saying at ESPN when a server would die for a few seconds or some other fleeting problem would come up:

“This isn’t a hospital. Nobody’s dying here. The worst that’s going to happen is people won’t be able to get their sports for a few minutes.”

Always helped to quell tension around the office…

Charge... 06 Dec 05

DHH and the …

Jamie Tibbetts 06 Dec 05

Interesting post, but I guess I’m failing to see why uptime should be such an expensive issue. Shouldn’t a site theoretically have 100% uptime? (Yes, I know, there’s no such thing as a perfect app.) Unless you have a poor web hosting provider or your code is somehow flawed, your site should never go down on it’s own. You may have to bring down the server for maintenance or pushing a new update once in a while, but that’s almost a negligible amount of downtime in the big picture.

DHH, you mention that it costs a lot of money to get that extra few decimal places over 99%, but why? Where does the money go? A few examples would be enlightening I think.

And you say that Basecamp has around a 98% or 99% uptime, but I find that hard to believe. Is Basecamp really down for 15-30 minutes a day? I doubt it.

To me, >99% should be standard and fairly easy to attain, while anything less is cause for alarm.

Am I missing something?

Phil Nelson 06 Dec 05

In response to Jamie and others:

Speaking entirely technically, 99% uptime would be roughly 42 minutes of downtime per month, which as we all know is the answer to life, the universe, and everything.

24 hours in a day, 60 minutes in an hour, 30 days in a month, 4200 minutes, 99% of which is 42.

So, that’s about 1.4 minutes of downtime per day. Roughly.

You see this same kind of logic in lots of polls, mainly political ones. Ever wonder why there’s a 3%-5% margin of error? The money and time investment goes up exponentially for those last percentages.

Don Wilson 06 Dec 05

Indeed, exceeding the 99% mark would be quite easy. If it isn’t, then you’re obviously on a bad server/network/what have you.

“Is Basecamp really down for 15-30 minutes a day? I doubt it.”

15-30 minutes per month is what I thought while reading the article. The only time I’ve seen it down is when they’re upgrading, and it was for a lot longer that 15 minutes. But then again I rarely/never use Backpack/Basecamp/etc.

Phil 07 Dec 05

Ooer, ignore the “month” part up there.

My point being that when it’s spread out over a period of time, 99% is virtually indistinguishable from 99.99% to the casual observer, and only marginally noticable to the observer with a vested interest.

Don Wilson 07 Dec 05

To throw in another comment in regards directly to the main post…

“The bubble was when people thought they needed to spend $3 million dollars buying Sun servers and Oracle databases to build a site for wedding invitations.”

Actually, the bubble was when investors invested too much into companies that couldn’t deliver. Most of the companies back in the late 90’s were developed to be aquired.

It seems like most “Web 2.0” ‘companies’ seem to be geared towards being aquired/invested into, but it also just happens to be the trend to say that they don’t want to/it’s not their goal. Do we see a trend here? :)

Anonymous Coward 07 Dec 05

Jamie, I think the point is *guaranteed* 99.99999% uptime. To guarantee that you need a pretty significant server cluster with full redundancy at every level (load balancers, switches, web servers, app servers, database servers, file servers, SSL accelerators, etc).

You may even want to have this set-up at multiple hosting locations around the country to hedge against a major connectivity issue at one farm.

And that’s just the hardware side of it. You need to build software to deal with all the redundancy (even across entire server farms). And the sys admins to handle all this in the event of an emergency. And…

So, the guarantee of 99.99999% uptime is a pretty serious financial undertaking and completely out of line for a new business just getting started.

Luke P 07 Dec 05

>The bubble was when people thought they needed to spend $3 million dollars buying Sun servers and Oracle databases to build a site for wedding invitations.

That one had me laughing for a while.

>DHH, you mention that it costs a lot of money to get that extra few decimal places over 99%, but why? Where does the money go? A few examples would be enlightening I think.

I’m guessing that server clustors and things of that sort that heighten redundency produce the higher cost.

Jamie Tibbetts 07 Dec 05

Phil-

Your math is a bit off. ;) 24 * 60 * 30 = 43,200. 99% uptime is 432 minutes of downtime per month.

A couple people have mentioned server clusters as a major cost, but that’s really only if you want to guarantee some obscene uptime like 99.999999%. You don’t need server clusters to achieve 99.5%. You’d just have to make sure you don’t have more than 216 minutes of downtime a month. Sounds doable to me.

Jamie, your number sounds doable if nothing catastrophic happens, but you could easily be down 216 minutes *in a day* if your one box blows up or your database blows up or your switch melts down or your load balancer goes bust or your host’s net connection goes down or something on that level. That’s just 3.6 hours of downtime which would be a blessing if you had a serious problem. You’re likely down for over a day should you have no redundancies in place for serious problems.

David Heinemeier Hansson 07 Dec 05

Jamie: Remember that if you’re down for half a day for an upgrade, you’ve spent your downtime for a couple of months. So any additional downtime is going to make you dip below the 99.

Chris 07 Dec 05

Jamie, what happens when your non-redundant server’s drive crashes? If you think you can get a new drive in place and recover all the data off the busted drive or from a backup in a couple of hours and then be back to normal as if nothing happened then you haven’t really been through such an event.

DHH-
Right. I get that. But if upgrades are included in your downtime calculations, how could you ever attain 99.99% uptime? The only solution would be to not have upgrades, and that’s not a very good solution. ;)

You mention it costs “tens of thousands of dollars” to get to the 99.99% mark, but I guess I’m still confused as how money can limit downtime. It certainly can’t make upgrades take less time. What are these thousands of dollars spent on?

Chris-
Setting up a hardware RAID on your box isn’t the type of major cost DHH is talking about. He’s talking about tens of thousands of dollars. Having a redundant setup with an extra HD or two on a server would only be a few hundred bucks.

Fred 07 Dec 05

Use IIS and you’re set.

pwb 07 Dec 05

Totally agree with the post. At the outset, the vast majority of a startup’s resources should be expended on figuring out how to get customers and make money. Energy spent on scalability is mostly wasted. Providing scalability on a just-in-time basis is fairly easy. Few systems or services fail due to scalability but scores fail due to inability to attract customers or make money. Most startups can only dream of having scalability issues.

Tommy 07 Dec 05

As I guy with a some “traditional” telecom background five nine realibility (99.99999) was what a top of the line Lucent PBX’s offered. I don’t know near as much about web servers and ISPs, but a high-end Lucent PBX could run you in ton of money. Insert millions depending on the number of lines.

Also, think about it a second. When was the last time you picked up a land line phone and you didn’t have a dial tone?

Now think how often you notice your cable or DSL contect is down. Directv has lost a signal. Not to mention a web site won’t load or you lost your cell phone signal.

I’d have to agree there is a huge difference between what he takes to have a 98 percent uptime, versus something like 99.99.

But that is just my two cents …

Tomas Jogin 07 Dec 05

I agree, 98% or 99% is a very nice uptime for a typical web app, ensuring 99.999% or whatever costs lots more money is really quite unecessary.

So many times have I heard people tell me of their ideas for a web app or site; they outline what it’s about, what it’s going to do, and what it needs. Almost everytime they spend 90% of their planning solving problems that they don’t have to adress yet, or perhaps ever. Most of time, these plans don’t even result in a single line of code, either.

Just-In-Time Problem Solving is what it’s about (again, without being stupid about it, 90% uptime is not going to cut it).

Spike 07 Dec 05

I think it is so important for a fledgling company to have good uptime. You’re building your reputation. If Tesco 404 me on a credit card page I will refresh with faith. I’ll come back tomorrow. If that happens with a company I’ve never heard of I’ll go someplace else.

Sam 07 Dec 05

Yeah, umm, interesting post, but your 10 minutes daily downtime comes at 6pm in New Zealand. Right when I’m trying to finish things off before I leave for the day.

It’s f**king annoying, and really, bearable but not altogether acceptable. So don’t try and pull it off as some kind of “less is more” shenanegan.

dusoft 07 Dec 05

PHP spaghetti soup? When was the last time you worked with PHP?

Ah, I forgot, Ruby saves the world. How pathetic.

Sean Brunnock 07 Dec 05

Do you take the same approach to security as you take to reliability?

oojee 07 Dec 05

and people wonder why there is so much criticism of web 2.0

Leonya 07 Dec 05

Does Textdrive (which you appear to be a part of) share the same approach? They surely seem to, that’s why I’m on the verge of dropping their service in favor of something more reliable…

Mark 07 Dec 05

The 98% time that you’re up is NOT what you’re customer experiences. They experience (most likely at the most inopportune time) the 2% slice of time you are not there.

It’s because Wal-Mart can offer 99.999% uptime (in the context of this post) that the customer expectation is that you will, as well.

It’s common to all of us. Case in point, have you ever gone back to surfing the Internet on a dial up after being on broadband for a while? It’s maddening! Not earth shattering, nor world ending, but frustrating none the less.

What if it takes a minute longer than you think it should to get your McFluffy breakfast sandwich and coffee in the morning when you’re about to start you’re very busy day.

It may not be the end of the world. But, it can very well be an earth shattering experience for your client at the moment.

In a world where the experience is more and more important, I say scale as much as you can.

Bill Preachuk 07 Dec 05

David - I’m glad your comment mentioned:

“We�ve generally found that people are quite generous about forgiving the occasional downtime if you explain what happened, apologies, and do it quickly. Be honest, be human.”

It;s sad that many Customers/Clients are so amazed when you give them truthful notification of what happened, why the site/service/app was down, and for how long. In the short-term… Goodwill.

Here’s something to add to the honest communication: let the customers know what your plan is to make sure that the problem is minimized (or entirely removed) in the future. Otherwise that goodwill evaporates really quickly.

Mark John 07 Dec 05

Scaling is not just about adding more hardware, setting up redundant database, load balancers�but rather more on re-designing your code. Although code redesign can only do so much in improving your site�s performance, but at least it�s the �cheapest� and the easiest option rather than investing on millions for hardware.

Tom Dolan 07 Dec 05

The importance of being up and a variable of who the service is for. If it’s something for you, you can always say, “Oh, darn, not working just when I need it, but I still love you.” If it’s a service for your customers, and you’re happy with 98% uptime, that means you’re okay with losing the customers who encounter the 2%. For some businesses, like Wal*Mart, makes perfect sense. For other businesses, particularly small consultancies or critical care providers, losing anyone can’t be so easily tolerated.

I completely agree that you need a product people want before you need a product that always works. But once you are lucky enough to *have* a product that customers want, you best ensure it works reliably (or price it accordingly). What if your phone or TV or car keys only worked 98% of the time? One would think there would be real market opportunity for a competitor that could provide a more reliable product.

joelfinkle 07 Dec 05

My wife and I run a small e-tail shop.
Downtime is deadly.
If Wal-Mart, Amazon, BestBuy, etc. are inaccessible for a hour or three (like CompUSA was in the first after-midnight hours of Black Friday), it’s forgivable — I know I can get back there later.

But for a mom-n-pop shop, where finding the site is the result of expensive clicks, random searches and just plain luck, being gone for even one page fetch is the kiss of death. Customers will not come back, thinking you dead.

This concept has been lost on my ISP, which only wants to credit me for the itsy-bitsy fragment of time I’ve been out of service. In my opinion, you break the 99% SLA, you refund my month at the very least.

It’s also astounding the number of ISP’s I’ve called (including mine) that AREN’T using RAID on their disks and clustering of their servers on shared hosting plans: it would — at tiny additional cost — increase reliability and load balancing.

Michal Migurski 07 Dec 05

But if upgrades are included in your downtime calculations, how could you ever attain 99.99% uptime? The only solution would be to not have upgrades, and that�s not a very good solution.

Jamie, I think that companies who spend the $$$ to get that kind of reliability do factor in upgrade time. Part of the money goes toward buying two (or three) of everything so that services stay available during upgrades (think Klingon anatomy). This is why mainframes are so expensive - they’re designed to never, ever be turned off.

Paul Watson 07 Dec 05

Actually Technorati seems a good example of this, Tory. They are slow, they go down often, people complain everyday about them and yet… somehow people use it everyday and talk about Technorati everyday and as much as we dislike it, it survives and thrives. Weird.

Jeremy Wright 07 Dec 05

I never said people needed to be up for 5 9s in order to succesfully run a business. But many of the companies in my list have had months where they weren’t even up for 2 9s overall on the month.

The reason was simple, they didn’t plan for growth.

Danny 07 Dec 05

I think there’s probably merit in both sides of this argument, but I would suggest that it’s a very retro kind of analysis based on the assumption that all fuctionality will be local to the particular system. The Web is the Platform. What’s the uptime of that?

since1968 07 Dec 05

Having just gone through a project launch on a server experiencing 98% uptime, I can personally attest that there’s a world of difference in perception between 98% and 99% uptime. Doesn’t sound like a critical difference, but it is.

Peter Cooper 07 Dec 05

FeedDigest needs ultra uptime, since a lot of people use it by adding Javascript includes onto their pages.. if FeedDigest isn’t operating properly, their pages never load.. and when you’re serving over 2 million page views a day, that gets pretty intense. In retrospect, it’s one thing I dislike about providing a service that has to be so rock solid.

Dan Ciruli 07 Dec 05

I think David makes a good point, but I think he mistook Jeremy’s original point. Jeremy wasn’t saying you need 99.999% uptime. What he said was that you should think about scalability from the time you design your software, not later. If you don’t think about scalability at design time, it’s going to be much more expensive to scale it later because you’ll be rewriting the damn thing.

I totally disagree with that. That you have to think about scalability on day 1 or it’ll be more expensive later. I’m off the reverse opinion. Thinking about scalability before you have real people using real data in real ways is kinda like fortune telling. You might be lucky and guess right, but it’s much more likely that you’re going to think the bottlenecks would come in places they aren’t and not think about the bottlenecks that did become troublesome.

In other words, premature optimization is the root of all evil.

But. Realize that I’m making these statements on the assumption that you’re using a sane tech stack. So not an Access database with a spaghetti ball of ASP, but more like any good LAMP stack (our choice being Ruby on Rails).

It’s hard to overstate how important experience is. In most cases I’m familiar with, uptime is a factor of the developers’ experience. IOW, planning for scale is not at all like fortune telling if you have a few big projects under your belt, and ideas simply “smell right” if they are any good — premature optimization won’t even be a consideration, because the idea is optimized at the design stage. The database is where most bottlenecks end up anyway - choice of dev frameworks is less of a factor in post-launch uptime than one would think.

“PHP spaghetti soup? When was the last time you worked with PHP?

Ah, I forgot, Ruby saves the world. How pathetic.”

I agree. How many forum, cms, etc packages are available for Rails vs. PHP? What is the percentage of Rails hosting servers vs. PHP? Why is SvN, the blog of the company that started Ruby on Rails, written in PHP?

Because Rails is a designer’s “programming” language, and always will be.

Being Frank 07 Dec 05

So you guys are offline for no more than 30 minutes throughout an entire 24 hour cycle. Most likely you’re going off line when you have the least amount of users (1:00AM or something).

So whats the big deal again?

This discussion always fascinates me. Even the handful of folks who agree, hedge.

And the examples given are fun, too. FeedDigest is a completely non-critical service that could go down for hours and it wouldn’t matter. In fact, the FeedDigest customer might not even find out about the outage since it’s his customers, not he, who might witness it. Technorati is another one. The site is frequently barely live and remains the leader.

The truth is, not only do you not need to do any planning for scalability, you compromise your chances for success by doing so.

No service ever failed from an inability to scale. Zillions of services fail from an inability to attract customers and make money.

Why is SvN, the blog of the company that started Ruby on Rails, written in PHP?

Oh, Don. Understand what you are talking about before you comment. SvN runs on Movable Type, not PHP.

JF 07 Dec 05

How many forum, cms, etc packages are available for Rails vs. PHP?

And your point is? First off, Rails has been around only a little more than a year. Secondly, what does the quantity of packages available have to do with anything? Are you suggesting Subway makes the best sub sandwiches because they have the most locations?

One thing to note here is that when they read “99% uptime” some people seem to think 99% a day (14.4 minutes) or 99% a week (100.8 minutes), distributed evenly.

Others are thinking more along the lines of 99% uptime over six or twelve months, and not distributed evenly at all; instead, caused by one or two nightly upgrades and perhaps one 20 minute crash recovery.

Being down 100 minutes every single week or 15 minutes every single day gives your users a hell of a lot worse experience than one or two scheduled nightly upgrades and a 20 minute crash recovery during the course of six to twelve months.

“And your point is?”
People don’t write shared packages of software in Rails because it’s much harder to share with other people than just uploading files like you do with PHP. That’s my point.

First off, Rails has been around only a little more than a year. “Secondly, what does the quantity of packages available have to do with anything? Are you suggesting Subway makes the best sub sandwiches because they have the most locations?”
It’s not quantity that I’m after — it’s that there simply isn’t any out there. Aside from todo list tutorials, of course.

“Oh, Don. Understand what you are talking about before you comment. SvN runs on Movable Type, not PHP.”

Instead of using Movable Type, why don’t they use their flagship programming language instead of PHP? Is there not a package available in Rails? Interesting…

Yes, Don, there is a great package in Rails called Typo, but we don’t have time right now to completely change the back-end of this blog. MT is fine for our purposes at this time. Moving to Typo would not be time well spent right now.

Typo has a migration script from MovableType that wouldn’t take too long at all and you wouldn’t have people like my harping on your back because you use a competing programming language’s application.

Anyway, this isn’t the place for a PHP vs Rails debate anyhow. Thanks for the link.

Son Nguyen 07 Dec 05

It depends on the usage and type of business but definitely it’s an important element that should be considered in all steps of the development process. “Think big, start small”

wouldn�t take too long at all
Famous last words!!

So now you have to always use the same language for all your projects? That’s kind of a dumb idea.

JMasterson 07 Dec 05

Here’s an article that explores the 100% uptime question nicely.
http://www.codesta.com/knowledge/management/uptime%5Frealities/

Summary: Stuff breaks. DDOS attacks happen. And without completely automated recovery systems and completly redundant hardware, replication, etc, you’re doing awfully well at even 99.1% over time.

Alex Burton 07 Dec 05

i agree that you shouldn’t go over board, but saying that, it is not that hard to architect a system that can have built in redundancy (which might not be used in the initial deployment) and horizontally scale!

Again, to repeat, if your company lives or dies by its uptime, you’d damned well better be planning for it, building for it, coding for it, architecting for it, etc.

If 37signals doesn’t live or die by its uptime (as David claims), then this doesn’t apply to him.

But it goes beyond uptime, as Ian pointed out. If you’re up for 5 9s, but unusable due to slowness, that’s bad. Similarly, if you’re only up for 1 9, but all your downtime is when less than 1% of your users are online then that isn’t so good.

The point of the rant was directed at companies who DO need their uptime. Who DIDN’T plan for it. And who are now complaining that they couldn’t have foreseen this.

Nobody’s asking a small 3 man company to achieve massive uptime on Day 1. But if that 3 man company plans to deliver a service to millions, at least plan and be ready to implement well architected solutions. Don’t just think adding more iron’ll solve your problem because it introduces just as many issues as it solves.

Adria 08 Dec 05

I’m not a user of basecamp. We use a wiki (twiki.org) as project-management system at work. A bare 98% uptime for this type of application is uncceptable. Discounting weekends, this would mean we would have almost 2 hours of downtime every couple of weeks. This makes working annoying and breaks the employee trust with the project-management system. 98% uptime is ok for slashdot, not for tools used to earn a living.

Hi 08 Dec 05

There’s something in common with delicious, feedster, and technorati that you might not have thought about…

None of them have a business model that would cause them a lot of heartbreak if they had downtime. They’re all ad-based, IIRC.

Your average online store has a much larger problem, if they go down they are basically abandoning almost all orders that were in progress in that time period…. A lot of people will simply go to a competitor, complete the order, and forget about the site that went down on them.

Businesses who provide a service have it much easier in this regard, as for a variety of reasons, perhaps they paid for an account, perhaps the features are killer, etc, the people will come back and not miss a beat, perhaps bitching a little in the process.

I just don’t think that this (as, unfortunately, many things I see on this web log) was thought about after the initial strike of emotion arrived. It doesn’t really help anyone in the long run to make sensationalist remarks.

Jens Meiert 08 Dec 05

I claim that the impact of non-availability is somewhat similar to a normal parable: It has a high impact for smaller sites (if somebody links to you, for example, he might not hesitate to remove this link if your site is once down - you never know), a lower impact for medium sites (the ones you mentioned, Delicious, Feedster, or Technorati, for example, since nobody expects them to be gone), and high impact again for large sites (like Wal-Mart, which suffers high losses in money and, more important, trust).

mj 08 Dec 05

joelfinkle wrote: This concept has been lost on my ISP, which only wants to credit me for the itsy-bitsy fragment of time I�ve been out of service. In my opinion, you break the 99% SLA, you refund my month at the very least.

Joel, I’m sure you will find a service that will refund you a whole month for 1% downtime. Hell, I’m sure you’ll even find a service that will offer you a million dollars for every 0.1% downtime you experience.

… but you will have to pay for it: arm, leg, and first several born.

So now the question is: how much does that blink of downtime really cost you? Are you ready to make that kind of investment into your ifrastructure? Will you see a ROI on those last fractions of a percent uptime?

That is the point of this article. I’m reading a bunch of comments along the lines of “well, if your backend is reliable, then how is 99.99% hard?” Under normal circumstance that shit is easy: hell, I’ve had an 10 year old PC running OpenBSD as a firewall; it went a year and a half with a reboot. Did I do anything special to ensure 100% uptime? No, I was damn lucky — and weather, the power company, local hoodlums, and everything else under Murphy’s belt just happened to miss me for those 18 months. When you want to be a breath away from always always on, you are paying for multiple data centers spread through out the country on different backbones.

Joel, does your e-tail business make enough money to pay for that kinda uptime?

John D. Mitchell 08 Dec 05

Hmm… I posted my (fairly long) comment yesterday afternoon and it’s hasn’t appeared yet. Unless I’m forgetting something, my comment certainly fit your criteria for a posting. If you don’t wish to publish it, please email me the contents and I’ll blog it instead. Thanks, John.

Geoffrey Sneddon 08 Dec 05

I think Google Analysis is a good example. They’re hardly a minor company, yet they had scalability issues.

phpkerouac 08 Dec 05

the five nines always refers to “unscheduled” downtime. this does not include standard maintenance, upgrades, updates, et al…

It is only when the service is unavailable and you didn’t anticipate it.

JSR 08 Dec 05

“Hmm� I posted my (fairly long) comment yesterday afternoon and it�s hasn�t appeared yet. Unless I�m forgetting something, my comment certainly fit your criteria for a posting. If you don�t wish to publish it, please email me the contents and I�ll blog it instead. Thanks, John.”

You must have fallen into that 2%. :)

Just one comment having NOT read the other comments. This article is an easy argument to make, and I tend to agree with the sentiment. But speaking as someone who is now maintaining a very inefficient website, it’s MUCH harder to go back and fix things than it is to do things right in the first place. I think there’s much more of a balance here than you do between speed to market and time spent building properly.

nate 08 Dec 05

“A few extra servers, and our reliability grows, how expensive was that?”

Well, now you’ve removed one of the single points of failure.

What if power goes out at your site? Throw in a generator.

What if a network cable goes down? Throw in two NICs.

What if the switch the NICs are plugged into go down?
Throw in two switches.

What if the router dies? Throw in a pair of routers in a highly available scenario.

What if the link between you and your ISP goes down?
Throw in two links.
To two separate ISPs.
Peering BGP, so those routers you bought better be capable.

What if the cable from the telephone room to the telco is cut? Better put in two cables, on different sides of the building.

What if the site you’re hosting in is hosed by an earthquake/hurricane/tsunami/meteor?
take everything you just doubled AND DOUBLE IT AGAIN.
And then add in a crapload of bandwidth between the sites so you can effectively backup and synchronize data between the two sites.

And really, you’re going to need another highly available third party to keep track of when your sites are ‘active’ or not- if both your sites think they’ve become the only available site, you end up with messy issues once they start to talk to each othe again!

Once you start to comprehend the scales involved (and how things get exponential quickly) makes you realize just how expensive this stuff can get.

miscblogger 08 Dec 05

I totally agree. Unless you are a site dealing with millions of credit card transactions, at least 98% uptime is alright.

Chris 08 Dec 05

“Once the first complains starts to trickle in.:

It’s no longer appropriate to wait until it starts to break to fix it. With the boom of social bookmarking, sites go from unknown to famous in minutes. If you don’t invest in a mature architecture, your site could be killed by it’s own popularity and quickly return to obscurity.

When you build with redundancy in mind, you often get the ability to scale. Redundant web servers can more easily become a farm of servers, etc.

Don’t strive for 98%. Would you publish an application if you knew 2% of the lines of code were buggy?

James Thomas 08 Dec 05

Don’t put all your eggs in one basket, and don’t promise something you can’t provide.

Random Shmuck 08 Dec 05

Pretty obvious those who have worked/thought around a high reliability environment and those who haven’t. I work at a doctors office, and while we can afford to have about a 90% uptime (aka things are dead on the weekends), we can’t afford a critical failure. Everything is moderately redundant, backed up, etc.

Upgrades are rolled through the clusters - when you upgrade you do it only a few nodes at a time working your way through using your hot spare capacity temporarily as you go. When you build clusters you make sure you have N+S redudancy at least. N being the number you actually need to do the job, and +S being your hot spare(s) - No less then 1. Automated processes transparently switch things from the primary to a spare in a relatively quick amount of time - from mere seconds on the citrix cluster, to just shy of 5 minutes on the SQL cluster (it takes time to cycle 4gb of tables into RAM from an 800GB raid 5 array, once the array has failed over to the backup server.)

The main reason the SQL fail over is slow is because we didn’t invest the extra 10s of thousands it would have taken to actually do the SQL server N+1 load balanced. If we had invested in the redundant licensing, the uprated array controllers, etc we could have near instant fail over… as is the 5 minutes is good enough for that odd weekend when one of our doctors is called in to do an emergency knee reconstruction - if it’s down when he gets there it will be back up by the time he’s out of surgery.

RMX 08 Dec 05

You make things sound harder than they really are.

Setting up redundant, load balanced web servers and replication for a database is *extremely* cookie-cutter - and any reputable hosting provider will provide you with outlets connected to different circuits.

That alone will get you from 98% to .98*.98 = 99.9% or so.

The freedoms this minimal redundancy buys you:

* The ability to do zero-downtime upgrades by upgrading half your cluster at a time.
* The ability to test new releases on production machines.
* The ability to ignore 98% of downtimes until the next work day that practically lets you ignore your pager (if your cost/benefit of downtime vs. overtime makes this a good tradeoff).
makes it well worth doing even for companies who don’t need it.

Heck, the productivity gains of being able to visit our colo at our leisure made it well worth it at my previous company — and we were “only” doing about $700 in revenue per hour (a half million per month).

In my new company, we cloned the architecture simply because the couple thousand dollar hardware costs pales compared to the advantages.

Luke Stiles 08 Dec 05

Hey - I used to work for that wedding site! Or at least one that spent nearly that much on the hardware and software as described.

But, I agree with the poster who said it was the investors’ fault. They gave the money with the understanding that it would be spent more or less as it was, often demanding that it be spent that way.

And besides, we were actually taking in significant data feeds from partners and had a pretty big database as a result. Still probably not quite worth it. But big iron is much more fun than (at the time) non-existent clusters.

Cable Guy 08 Dec 05

I was on part of the teams that were bringing redundant networking to the stores. I know that Wal-World was keeping us busy, doing at least 15 to twenty stores a month….and that was myself, an electrician, and a cable grunt — paying for our meals, hotels, so on, and so forth.

The only part of Wal-Mart’s system that wasn’t redundant was their in-store controller (SMART/SAMS). Some of their database servers (JW1/JW2), their Wireless network controllers (ABS01 and ABS02), and misc other things were redundant. They were also pushing out Blade servers for the pharmacy while I was doing that work….they might be pushing out an close to an entire store on blade servers in the near future.

Scott Partee 08 Dec 05

I’ve been banging my fist on tables about this for years. People simply don’t understand what’s involved with those last few decimal points. The expenses go up nearly exponentially to attain that level of uptime, so the business value had better be significant.

I have experience on both ends of the spectrum — from bootstrap ISP and low-rent hosting to Fortune 100 IT architecture and at all ends of the scale spectrum, people have very little understanding.

At the low end, you get people like the above-mentioned “mom and pop” guy with the “deadly” downtime. I don’t, for a second, doubt that downtime for him is extremely painful and costly, but the cost has to be understood in terms of the expense to eliminate that downtime cost. If this guy is paying for a small hosting account and DSL for a total of, say, $100/month, and that’s the main portion of his operating budget, then it will seem enormously important to him that his money guarantee what he perceives to be reliability. But to the phone company and providers, who have millions of customers and pay billions for infrastructure, his $100 does not justify highly redundant infrastructure.

Just about every modern midrange OS, from Windows Server 2003, OS X, Linux flavors, Solaris, HP-UX etc to the midrange networking equipment (i.e. not carrier grade) in non-redundant configurations running on decent hardware in a decent data center can get you to the mid 90s inclusive of planned downtime. Heck, my little Linksys router I use at my home goes 3 or 4 months before I have to reset it for a total of about 12 minutes of downtime/year.

To go from the mid 90s to 99% requires doubling of infrastructure and load balancers/clusters and a data center with redundant connections, backup power and redundant network hardware.

To guarantee anything above 99.0% to anything after the decimals requires redundant sites.

As the commenter noted above “do everything you just did, and DOUBLE it”.

And, as others have noted, that’s just from the infrastructure perspective. Application software and custom application code has to be written to take advantage of the hardware redundancy or to route around failures. Even the Big Guys have a lot of issues with this, despite the millions if not billions spent to attain the next 9s to the right of those decimal points.

I run a low rent hosting service that is basically for friends and accomplices. One friend placed an ecommerce site he did for a client on there, and, while it was “fine” in terms of uptime (at about 98.9%), the client and he perceived the downtime to be too high given the fact that they did numerous hundreds of thousands in revenue through the site. I wasn’t offended when they left our service, but was sort of slightly amused to see they simply went to a comparable but more expensive single server service at another host and experienced slightly *more* downtime. I still think it was a good move because they get better support, but the fact is that for the vast majority of sites in the world, one does not need more than one server and one does not usually require support. And to bump past that first 99 is one hell of a thing.

Another misconception is that people often think of uptime or downtime in relation to a host. This is bad for a couple reasons. First, a host that has been up for a long time is a host that hasn’t been patched — and this is an invitation to unplanned downtime of the worst kind. Second, downtime should be measured at the “service” level. If your application is available and servicing request, which is why horizontal scale and clusters are good

In short, I agree with this post %99.999 percent, although I think you vastly understimate the amount of expense required to go from 95% to 99% to 99.9% and beyond. If you knew the true cost (which I do because I spec these things out and design them all the time) your argument would sound even more persuasive ;)

I also agree that the cobbled together approach of Web 2.0, where every user’s personal experience is accomplished through an integration of diverse web services is going to result in the user noticing outages a lot more frequently.

Mat Atkinson 08 Dec 05

This uptime/downtime discussion is now running through several blogs. Lots of fun and giggles debating 98%, 99%, five nines and so on.

I worked at a dotcom in 1999/2000. We raised VC and spent $gazillions with Hewlett Packard on terrabyte-this, failover that, triangulated whatsitcalled. Great fun. Lots of happy techies with big smiles creating a BULLETPROOF web application.

Only problem was someone forgot to get the customers! We got it all wrong, should have started fast and low cost, figured out what the world wanted, then scaled.

Folks, plan to scale by all means, but don’t spend the cash until, and I mean JUST until you have to.

“Actually, the bubble was when investors invested too much into companies that couldn�t deliver. Most of the companies back in the late 90�s were developed to be aquired.”

The whole web 2.0 thing is starting to feel spookily dotcom. The only difference is that lower start-up costs mean that there could be even more new companies competing for attention since raising VC is not a barrier to getting a business off the ground.

sean coon 09 Dec 05

well, i worked at datek and ameritrade for three years. our bonuses were greatly affected if we had more than a few red days per quarter (red = slight system hiccups, not even downtime).

we we’re selling product like wal-mart; we provided access to capital for timely trades into a virtual market. try designing/developing under those pressures.

Tyler Prete 09 Dec 05

Perhaps the best site that gets away with shitty uptime and scalability is Myspace.com. I would honestly be suprised if it has better than 95% uptime, and these downtimes and failures come often and at annoying, unacceptable times, but the site still continues to grow every single day. Users don’t seem to care. Perhaps this is just because they mostly tailor to the teenage crowd, but the point still stands that just because something crashes doesn’t mean people immediately run somewhere else.

Peter Cooper 09 Dec 05

Don Wilson: One of the reason there might not be many CMS, forum, etc, apps available on Rails just yet is because there’s so much money in Rails work right now that we’re (mostly) working on closed source, proprietary software to keep food in the cupboards.

When Rails coders are a dime a dozen like PHP coders, then perhaps more open source will follow.

Dan Ciruli 09 Dec 05

Hasn’t this discussion gotten fairly off-track from Jeremy Wright’s initial point? He certainly wasn’t advocating startups buying $500,000 worth of hardware for their alpha testing. He wasn’t really even talking about availability at all—he was talking about scalability. His post was about designing software properly so that, if and when you succeed, your software can scale.

Yes, you’re all right about uptime and availability—each “9” is another order-of-magnitude in cost, and you need to balance the need for that availability with your business needs.

But if you didn’t write your software to be scalable at the beginning, you’re going to end up spending money to rewrite software—and that is a waste of money (money that would be better spent on increased availability!).

Sourabh Niyogi 09 Dec 05

A lot of you are ignoring that when a key reason for downtime has nothing to do with #s of servers and the claims of data center as it has to do with how many people you have watching your server farm for attacks, and having them updating your server farm appropriately to minimize the chances of an attack. Having guys who know what is going here is not a trivial problem, and the myth that the “expensive” data center would like you to believe is that they will do 100% of it for you. Not true.

Another obvious issue is version upgrading. Everyone has to update their web servers all the time — database changes and algorithm design. Sometimes the migration path is achieved at 2am, where some engineer is awake for 2 hours doing that change (I do this myself occasionally for my SOAP/REST web services). Designing a system so that things work between 2am and 4am as well is way way harder than a system that goes down for those 2 hours.

In either case, it has nothing to do with the number of 9s that the data center claims or the cost of hardware. It’s about the cost of people who have to take the time to know what is going on (security) or have to go through (unusally unnecessary) pain to get the extra 9s in there.

I think the first cost is inescapable given how much unpredictability happens on servers (not $hardware related). The second cost is very very calculatable, and your post is on the money about not going for the extra 9s.

Vinnie Mirchandani 10 Dec 05

I help companies negotiate outsourcing contracts and you are right - the high uptimes come at higher costs. But believe it or not, our small businesses have more global reach than an average $ 1b US corporation. My blog is read 40% outside the US. With that perspective, 99.9% uptime is often not good enough.

My question to you (and I asked the Six Apart guys the same question when Typepad had problems) - why not buy hosting services from a global outsourcer and tie them to the demanding SLA? The capacity you need should be a minor addition to their global infrastructure and people capacity…and such hosting has become so much cheaper compared to 5 years ago

vinnie mrichandani 16 Dec 05

Six Apart which hosts my Typepad blog and those of many others has been down for at least 12 hours now and it appears they may not be able to recover the last few days posts of thousands of their subscribers. The cost of this bad will far far exceeds the cost of high uptime. They had maintenance problems 3 months ago and offered customers a service extension to make up for the problems. Those refunds hurt revenues…this time around it will be far worse for them, I believe.

Dennis Howlett 16 Dec 05

Irony - the URL is 6A TP. Don”t expect anything too recent.

I’m with Vinnie on this. Scalability has always been an issue for Internet Whatever. That’s why IBM, TIBCO, Accenture, EDS and others earn billions of dollars doing the heavy lifting that’s integral to making the Internet a safe and reliable place to do business.

Today - 6A forgot that.

So now i’ll make a prediction. 90 days and counting - you work out the question.

JK 17 Dec 05

All I know is, it’s a lot easier to fix your website when you have two servers. It’s a lot easier to grow your website if you plan ahead to make it run on two server, with a third database server, as well as remote development sandboxes. I’m maintaining a couple sites with code that doesn’t do this, and, it’s a real pain in the ass.

Brett Leber 19 Dec 05

The timeliness of this post is striking, since del.icio.us has gone down for the second time in the past week due to an ugly power outtage. (I’m sure their customers will understand once they come back up.)

Nazz 28 Dec 05

Back to the article, just because a company like Walmart loses credit card processing for 30 minutes does NOT mean they lose millions of dollars. It means they cannot process those millions of dollars yet. Not every single person will drop their items and walk out the door. Even those that do may even come back and buy something later. It would be bad yes, but many IT people overestimate the end result.

Next, 99.999% uptime is a total myth, nobody has it. If they say they do, you are being lied to. If you think you have 5 nines (~5 min. downtime per year), then I’m quite sure you are not patching your production databases, applications and operating systems and you probably have more security holes that you can count.

Nick 03 Jan 06

actually, 99.999% availability of a business service is not a total myth. that’s not uncommon at all.
what is uncommon are systems built with no redundant components but having 99.999% availability. as you say, its unlikely you can always perform routine preventative maintenance to single points of failure without affecting the service to the end users.
if you are taking out redundant components for some maintenance (scheduled or unscheduled), this may cause lower uptime figures for those components - but that’s irrelevant to the service user if the outage is transparent. this is in fact the reason this ‘5 nines’ availability business is so expensive - you’ve bought more than you need and developed a framework or process to deploy, manage and maintain them.

i’ve seen in some mid-sized retail stores, if the live connection to the bank goes down, they reach for those old fashioned carbon copy imprint credit card things. that’s because in some environments, people won’t shop twice - fashion clothing is one situation I have seen it.

note 27 Mar 06

I was searching the web and found your entry . I really like your site and found it worth time reading through the post.
Thanks all.

e.senguttuvan 13 Apr 06

pls give a details to my id.i interested to business and sales benifits.and now live in india.so,pls give to money earning particulrs..thanks