Developing Perspective

#137: Catastrophic Failure


00:00:00   Hello and welcome to Developing Perspective. Developing Perspective is a podcast discussing

00:00:04   news of note in iOS development, Apple, and the like. I'm your host, David Smith. I'm

00:00:08   an independent iOS and Mac developer based in Heron, Virginia. This is show number 137,

00:00:13   and today is Thursday, August 8th, 2013. Developing Perspective is never longer than 15 minutes,

00:00:18   so let's get started.

00:00:19   All right, so I'm going to be talking about catastrophic failure today. It's a thrilling

00:00:25   tale involving much sadness, much weeping, much gnashing of teeth that I had yesterday.

00:00:31   So that was Wednesday, August 7th.

00:00:34   If you're a feed wrangler user, you would have noticed that the service was down for

00:00:37   about, I think it was about a half an hour, 35 minutes, I think is where I ended up with

00:00:41   the actual outage being.

00:00:44   And so I'm going to kind of walk through what happened.

00:00:46   It was certainly a first for me.

00:00:48   And it's one of those things that you kind of, you prepare for, you plan for, but you

00:00:53   you really, really hope that this day never comes.

00:00:58   And so when it does come, it's kind of an event.

00:01:00   And I thought it'd be interesting

00:01:03   to kind of walk through what happened

00:01:04   and to talk about how I prepared for it,

00:01:05   what happened, what I've done,

00:01:09   and just kind of hopefully just put that out there

00:01:11   and be transparent.

00:01:14   Obviously, a lot of you, I think, are also customers,

00:01:15   so maybe it's interesting to hear what happened.

00:01:17   All right, so a bit of background first.

00:01:19   I'm a, in addition to doing iOS and Mac,

00:01:20   I've done Ruby on Rails, which is a web framework.

00:01:25   I've worked professionally for,

00:01:28   it's coming up in I think about seven or eight years.

00:01:30   I started learning and working with Ruby on Rails

00:01:33   back just before it went 1.0.

00:01:35   So I think it's quite a few years ago.

00:01:37   And before I was an iOS guy,

00:01:40   I was a professional Ruby on Rails consultant

00:01:41   or I worked at a company working as a Rails guy.

00:01:43   And I'm pretty familiar with it,

00:01:47   and I've been doing it for a long time.

00:01:46   I've worked through all the different changes to the language and to the way that it works.

00:01:51   I would say I know Rails pretty well.

00:01:55   I'm not the scary expert in Rails, because I'm a bit too much fingers in too many pies.

00:01:57   I focus a lot more on iOS and Mac kind of stuff now.

00:02:04   So it's hard to-- I'm not quite as completely in that world as I used to be.

00:02:07   But I'd say I'm pretty good, more than competent.

00:02:12   If I was looking at a technical book, you know, in the back, they have the novice, intermediate, advanced,

00:02:15   a little thermometer.

00:02:17   I'd probably be in the advanced section on the thermometer.

00:02:20   So I know Rails pretty well.

00:02:22   So yesterday I was doing some work,

00:02:24   and part of what this did is it included

00:02:27   a couple of what Rails calls database migrations.

00:02:29   And database migrations are always something

00:02:31   you want to be a little careful with,

00:02:32   but essentially what they do is in Rails,

00:02:35   you describe a, every time you want to make

00:02:37   a database change, you describe it into a file

00:02:40   that's designed to be kind of running sequence.

00:02:42   And then when you run that migration, it goes through, works out what the most recent one

00:02:47   that you've run is, and then runs them through all the new migrations that happen beyond

00:02:53   that.

00:02:54   So this is typically things like adding columns to a database table, adding an index, that

00:02:58   kind of thing.

00:02:59   And so this is kind of what we had.

00:03:01   The work I was doing included one of those.

00:03:04   And then essentially what I do when I have a database change like this that's part of

00:03:08   some logic or some code that I'm updating.

00:03:11   I hit run my deploy script, and off it goes,

00:03:16   and it does this migration as part of the rollout process.

00:03:20   And so yesterday I've been doing this change,

00:03:24   kind of something I've done millions of times,

00:03:26   take it to my kind of staging database, run it there,

00:03:27   everything looks good, no problems,

00:03:31   things are looking good in terms of the code

00:03:33   and the performance, I'm like, "Yeah, this is great."

00:03:35   So I hit deploy.

00:03:37   Things are going along like normally.

00:03:40   The code cycle part of it kind of happens first,

00:03:39   kind of a two-phase process, and it rolled out the code, things looking good, and it

00:03:42   gets to the database part. And that's where things got interesting. So at that point,

00:03:50   I've watched probably thousands of these migration scripts being run over my years as a Rails

00:03:55   developer, and immediately I knew something was funny. Something wasn't going quite right,

00:04:03   because it was telling me about how it was creating automatic primary keys.

00:04:08   And based on what I was doing, that shouldn't have been part of it.

00:04:13   What I was doing, it was also affecting tables that weren't related to the changes that I was making.

00:04:18   So I'm like, "Wait a second. What's going on?"

00:04:23   And I immediately just have alt-tab over to a browser where I had feeder-angler running,

00:04:28   or I try to refresh the page and I get 500 back,

00:04:33   like this system is down, what is going on?

00:04:37   I look over to the migration, it's finished, it's done.

00:04:40   I look at it and I'm like, what happened?

00:04:42   So I open up my production logs,

00:04:45   I look in the database area, and I realize that

00:04:47   rather than simply applying the small subset of changes

00:04:52   to the dataset that it was supposed to do,

00:04:54   the migration had actually gone through,

00:04:57   and, well actually, maybe a better way to start,

00:05:00   to say, to tell a story.

00:05:02   So I open up my production log,

00:05:03   and I see the two words that anyone who's ever done

00:05:06   any database work, or any web work,

00:05:08   never wanted to see, and it's a drop table.

00:05:12   So yeah, what it had actually done,

00:05:14   is it had gone through and dropped and recreated

00:05:18   every table in Feed Wrangler,

00:05:20   which is awkward because having dropped and recreated every table in the database, it

00:05:26   means that there was no data.

00:05:28   All the data is gone.

00:05:30   All the data was completely gone from the system.

00:05:36   All my user table, my articles, user status, everything, just gone.

00:05:41   And so at that point, I had a moment of panic.

00:05:45   I think the only experience I can think of that was a similar kind of-- similar to this viscerally,

00:05:50   if you're a parent of a small child, you probably experienced this, where you see your kid about to

00:05:56   do something that's going to hurt them slightly. So the classic one is if you see them, they're

00:06:01   about to sort of fall out of a chair, or they're about to trip over something, and you have that

00:06:06   moment of just like, "Ah!" And you just freak out. I mean, what are you going to do? And then you go--

00:06:14   So I have that moment of freak out, and I'm like, "What just happened? It's all gone."

00:06:18   So immediately I go into the database, and shut down the database, and then in the hopes

00:06:24   that it hadn't yet propagated off to my hot standby slave.

00:06:29   So I go in there, and of course the slave had already propagated, the slave had no data either.

00:06:34   And so now you have an interesting situation.

00:06:39   thing. You kind of have, like a good reasonable backup strategy is kind of like an onion,

00:06:46   I guess you could say. It's got many layers. You're dealing with different failures and

00:06:50   different probabilities and different intensities of failures at different levels in your backup

00:06:55   strategy. So at your first level, you're kind of worried about things like database fit,

00:07:01   like disk errors, disk failures, those types of things. And that's what something like

00:07:05   having a hot stand by a read slave is really helpful for.

00:07:10   So if my master database died for whatever reason,

00:07:13   like just the hard drive's dead, something happened,

00:07:15   which hopefully isn't that likely given that it's on a VPS,

00:07:20   it's on a rated disk drive, but these things happen.

00:07:23   If some crazy disk error happens

00:07:26   and it isn't a logical failure, it's a physical failure,

00:07:28   then the read slave would have been fine.

00:07:31   I could have just told it to become a master

00:07:30   and turn it into the primary.

00:07:32   So that's great, I would have been able to do that.

00:07:35   So that's that first level of backup I have.

00:07:37   Then you have a thing in Postgres where you can do

00:07:39   point in time recovery with right ahead log archiving.

00:07:44   You have that kind of thing.

00:07:45   And then I have on top of that,

00:07:48   I have the just like system level backups,

00:07:52   which is, I essentially run something that is

00:07:55   taking whole machine snapshots on regular basis,

00:07:59   and then it archives them at regular intervals.

00:08:04   I think there's a couple that run every couple hours,

00:08:06   then it keeps the most recent ones for a week,

00:08:12   and then it keeps one of them for each week for a month

00:08:17   or something like that.

00:08:21   It's kind of a cascading system.

00:08:22   And so now after the initial panic works out,

00:08:26   this is what I'm trained for kind of mode,

00:08:31   of like, what do I do?

00:08:33   How do I recover this situation

00:08:36   in the way that is least impactful to users?

00:08:38   And obviously, in parallel to this,

00:08:42   I'm sitting there on Twitter and in other ways

00:08:44   trying to communicate with customers and saying,

00:08:46   yeah, there's an outage, things gone wrong.

00:08:49   And at first, I didn't really know

00:08:51   how bad it was going to be,

00:08:53   and so my messages are kind of more vague.

00:08:54   It's just like, sorry about the outage,

00:08:52   I'm working on it.

00:08:54   The next one's like it's gonna be a couple of,

00:08:55   it's gonna be at least a few minutes

00:08:56   'cause I need to recover from a database state.

00:09:01   And then at one point, so then I kind of,

00:09:04   while I'm doing that kind of customer side,

00:09:06   I'm also focused on what should I do?

00:09:09   So logical recovery is out of the question

00:09:13   because like I said, the read slave

00:09:15   had also had this happen to it.

00:09:16   So for some reason it had propagated off

00:09:20   into the read slave as well.

00:09:22   And so then I had to do some kind of lower level recovery.

00:09:27   So the first part I could do was kind of a point in time

00:09:28   thing with Postgres where you take the write ahead logs

00:09:31   and replay them to a certain point

00:09:34   and right before the failure happened.

00:09:36   And I looked at that and I could have tried to do it.

00:09:40   And so I was like, okay, well, I'll start looking at that.

00:09:44   And then line node, which is where I'm,

00:09:46   manages the backup system for me that I suppose ended up

00:09:49   actually being the thing that I used,

00:09:50   also lets you, you can take a backup

00:09:52   and create a spin up a whole new machine

00:09:54   that's a clone of that snapshot point in time.

00:09:56   So I go into Liner and I say,

00:09:57   "Hey, rebuild this, you know, my most recent database

00:10:01   "as a whole new machine, and just sort of start that process."

00:10:05   'Cause it takes a few minutes while it copies a lot of data

00:10:08   and gets the machine set up.

00:10:09   So I was like, "Okay, I'm gonna kick that off,

00:10:11   "I'm gonna have that going as my fallback,

00:10:13   "and I'm gonna immediately start jumping

00:10:14   "into the actual recovery."

00:10:18   So I started looking at the point-in-time recovery.

00:10:23   And long story short, it ended up not working

00:10:24   quite as quickly as I would have liked.

00:10:27   I think I probably could have gotten into work,

00:10:29   but it's a very fiddly process,

00:10:31   at least from me not being a professional Postgres DBA,

00:10:33   to get it working exactly right.

00:10:37   And from a data integrity perspective,

00:10:38   I was a little bit concerned of leaving the database

00:10:40   in an inconsistent state or an unstable state,

00:10:42   or things that will ultimately come back and bite me.

00:10:45   And so instead, the main line node came back up.

00:10:48   I sort of fired up the database,

00:10:53   and it wasn't as bad as I feared.

00:10:55   It was only a couple of hours,

00:10:58   or maybe several hours, probably,

00:11:00   I think it was about four hours-ish,

00:11:01   four or five hours of data that had been lost.

00:11:03   So that's not great.

00:11:07   The nature, at least, of RSS syncing

00:11:08   is that the data that most of what it's containing

00:11:11   is widely distributed across the internet,

00:11:14   and I'm just caching it.

00:11:13   comes back up, I can go and re-download all the articles.

00:11:15   And users are mostly just missing their annotations,

00:11:19   whether they've read it or not, stars, those types of things.

00:11:23   I had, I think, two users who created accounts

00:11:25   during that period that I had to go in and manually fix.

00:11:27   But generally, it wasn't too bad.

00:11:30   Got that spun up.

00:11:31   Actually, as a random bonus, I ended up--

00:11:33   it was a bit of a silver lining.

00:11:34   I upgraded the hardware that the machine runs on,

00:11:36   which I think I've been meaning to do for a while.

00:11:38   But I hadn't done it because it would have required

00:11:40   a few minutes of downtime.

00:11:41   So I figured, hey, while it's all systems down,

00:11:46   let's go ahead and run the upgrade.

00:11:49   So I did, so it's running on faster hardware now,

00:11:51   and it went pretty well.

00:11:54   You never really want to test your backup regimen.

00:11:57   It's the kind of thing that you,

00:11:59   or you never want to have to go beyond testing.

00:12:01   I've played with these before.

00:12:04   I've done restores, my backups,

00:12:05   and I've kind of gone through this process a little bit.

00:12:06   But to actually have to go and rebuild the entire dataset

00:12:07   from backup is terrifying.

00:12:10   And I'm just glad it worked.

00:12:13   Things are back up and running now.

00:12:14   Things are actually faster than they were before the outage

00:12:16   because, like I said, I upgraded all the hardware

00:12:19   that it's running on.

00:12:20   But I definitely don't want to do that again.

00:12:23   And so I definitely won't be going anywhere

00:12:25   near database migrations in Rails for quite some time.

00:12:28   I still haven't worked out what happened.

00:12:31   I'm sure there was just something in the way

00:12:32   that it was configured.

00:12:34   I'd recently done a Git merge into that repository

00:12:37   that I think maybe have messed something up.

00:12:40   But actually, I think I know somebody

00:12:42   who can say this better than I did.

00:12:43   And just replace the word "season pass"

00:12:47   with "database" in this clip.

00:12:48   And I think you'll have a pretty good sense

00:12:50   of what I'll be doing for the next while.

00:12:53   - Okay, let's never touch that website again.

00:12:55   Does the season pass migration thing?

00:12:58   Obviously is pure evil.

00:13:00   We can never go to that page again.

00:13:01   Don't even load that page,

00:13:02   because if we accidentally load that page,

00:13:03   for all we know, it's going to scramble

00:13:05   our season passes again.

00:13:05   And then I had to start the whole process all over again of rearranging everything.

00:13:09   Not a good user experience.

00:13:13   Exactly. So we're not going anywhere near migrations for a while.

00:13:17   We're just going to stick with what we know. I'm going to be running some SQL.

00:13:21   You need to add a field to a database. I will be going into Postgres. I will be running exactly

00:13:25   the alter command that I know is safe and won't destroy things.

00:13:29   I'll be retyping the birds drop table again in my life.

00:13:33   hopefully things are going well.

00:13:38   One thing that I think that was really encouraging

00:13:41   coming out of this, and I know this is certainly part

00:13:43   of just having a fairly geeky customer base,

00:13:45   is I tried to be very honest about what was happening

00:13:47   during the process and kind of what happened

00:13:49   after the process, and being very explicit about it.

00:13:51   It's not just kind of like, something happened.

00:13:53   It's kind of like the dev center outage,

00:13:55   where it's this very vague, like no one really knows

00:13:56   what's going on.

00:13:58   I try to be very specific.

00:13:59   I didn't get quite as many details as I did

00:14:00   just now on the show, but I try to be very explicit.

00:14:01   I think a lot of people, customers, really appreciate that.

00:14:06   Saying like, "Yep, sorry man, that happens.

00:14:08   Hope you hope tomorrow's a better day for you."

00:14:11   Which is just really encouraging to me.

00:14:14   And I think it's a good example of just being honest

00:14:15   as best you can is often just what customers want.

00:14:17   They understand that you're human at the end of the day

00:14:21   and you're just going to do your best.

00:14:23   All right, that's it for today's show.

00:14:24   As always, if you have questions,

00:14:26   comments, concerns, or complaints,

00:14:27   I'm on Twitter @_davidsmith,

00:14:28   david@developingperspective.com.