Podcast Search

00:00:00 ◼ ► Hello and welcome to Developing Perspective. Developing Perspective is a podcast discussing

00:00:04 ◼ ► news of note in iOS development, Apple, and the like. I'm your host, David Smith. I'm

00:00:08 ◼ ► an independent iOS and Mac developer based in Heron, Virginia. This is show number 137,

00:00:13 ◼ ► and today is Thursday, August 8th, 2013. Developing Perspective is never longer than 15 minutes,

00:00:18 ◼ ► so let's get started.

00:00:19 ◼ ► All right, so I'm going to be talking about catastrophic failure today. It's a thrilling

00:00:25 ◼ ► tale involving much sadness, much weeping, much gnashing of teeth that I had yesterday.

00:00:31 ◼ ► So that was Wednesday, August 7th.

00:00:34 ◼ ► If you're a feed wrangler user, you would have noticed that the service was down for

00:00:37 ◼ ► about, I think it was about a half an hour, 35 minutes, I think is where I ended up with

00:00:41 ◼ ► the actual outage being.

00:00:44 ◼ ► And so I'm going to kind of walk through what happened.

00:00:46 ◼ ► It was certainly a first for me.

00:00:48 ◼ ► And it's one of those things that you kind of, you prepare for, you plan for, but you

00:00:53 ◼ ► you really, really hope that this day never comes.

00:00:58 ◼ ► And so when it does come, it's kind of an event.

00:01:00 ◼ ► And I thought it'd be interesting

00:01:03 ◼ ► to kind of walk through what happened

00:01:04 ◼ ► and to talk about how I prepared for it,

00:01:05 ◼ ► what happened, what I've done,

00:01:09 ◼ ► and just kind of hopefully just put that out there

00:01:11 ◼ ► and be transparent.

00:01:14 ◼ ► Obviously, a lot of you, I think, are also customers,

00:01:15 ◼ ► so maybe it's interesting to hear what happened.

00:01:17 ◼ ► All right, so a bit of background first.

00:01:19 ◼ ► I'm a, in addition to doing iOS and Mac,

00:01:20 ◼ ► I've done Ruby on Rails, which is a web framework.

00:01:25 ◼ ► I've worked professionally for,

00:01:28 ◼ ► it's coming up in I think about seven or eight years.

00:01:30 ◼ ► I started learning and working with Ruby on Rails

00:01:33 ◼ ► back just before it went 1.0.

00:01:35 ◼ ► So I think it's quite a few years ago.

00:01:37 ◼ ► And before I was an iOS guy,

00:01:40 ◼ ► I was a professional Ruby on Rails consultant

00:01:41 ◼ ► or I worked at a company working as a Rails guy.

00:01:43 ◼ ► And I'm pretty familiar with it,

00:01:47 ◼ ► and I've been doing it for a long time.

00:01:46 ◼ ► I've worked through all the different changes to the language and to the way that it works.

00:01:51 ◼ ► I would say I know Rails pretty well.

00:01:55 ◼ ► I'm not the scary expert in Rails, because I'm a bit too much fingers in too many pies.

00:01:57 ◼ ► I focus a lot more on iOS and Mac kind of stuff now.

00:02:04 ◼ ► So it's hard to-- I'm not quite as completely in that world as I used to be.

00:02:07 ◼ ► But I'd say I'm pretty good, more than competent.

00:02:12 ◼ ► If I was looking at a technical book, you know, in the back, they have the novice, intermediate, advanced,

00:02:15 ◼ ► a little thermometer.

00:02:17 ◼ ► I'd probably be in the advanced section on the thermometer.

00:02:20 ◼ ► So I know Rails pretty well.

00:02:22 ◼ ► So yesterday I was doing some work,

00:02:24 ◼ ► and part of what this did is it included

00:02:27 ◼ ► a couple of what Rails calls database migrations.

00:02:29 ◼ ► And database migrations are always something

00:02:31 ◼ ► you want to be a little careful with,

00:02:32 ◼ ► but essentially what they do is in Rails,

00:02:35 ◼ ► you describe a, every time you want to make

00:02:37 ◼ ► a database change, you describe it into a file

00:02:40 ◼ ► that's designed to be kind of running sequence.

00:02:42 ◼ ► And then when you run that migration, it goes through, works out what the most recent one

00:02:47 ◼ ► that you've run is, and then runs them through all the new migrations that happen beyond

00:02:53 ◼ ► that.

00:02:54 ◼ ► So this is typically things like adding columns to a database table, adding an index, that

00:02:58 ◼ ► kind of thing.

00:02:59 ◼ ► And so this is kind of what we had.

00:03:01 ◼ ► The work I was doing included one of those.

00:03:04 ◼ ► And then essentially what I do when I have a database change like this that's part of

00:03:08 ◼ ► some logic or some code that I'm updating.

00:03:11 ◼ ► I hit run my deploy script, and off it goes,

00:03:16 ◼ ► and it does this migration as part of the rollout process.

00:03:20 ◼ ► And so yesterday I've been doing this change,

00:03:24 ◼ ► kind of something I've done millions of times,

00:03:26 ◼ ► take it to my kind of staging database, run it there,

00:03:27 ◼ ► everything looks good, no problems,

00:03:31 ◼ ► things are looking good in terms of the code

00:03:33 ◼ ► and the performance, I'm like, "Yeah, this is great."

00:03:35 ◼ ► So I hit deploy.

00:03:37 ◼ ► Things are going along like normally.

00:03:40 ◼ ► The code cycle part of it kind of happens first,

00:03:39 ◼ ► kind of a two-phase process, and it rolled out the code, things looking good, and it

00:03:42 ◼ ► gets to the database part. And that's where things got interesting. So at that point,

00:03:50 ◼ ► I've watched probably thousands of these migration scripts being run over my years as a Rails

00:03:55 ◼ ► developer, and immediately I knew something was funny. Something wasn't going quite right,

00:04:03 ◼ ► because it was telling me about how it was creating automatic primary keys.

00:04:08 ◼ ► And based on what I was doing, that shouldn't have been part of it.

00:04:13 ◼ ► What I was doing, it was also affecting tables that weren't related to the changes that I was making.

00:04:18 ◼ ► So I'm like, "Wait a second. What's going on?"

00:04:23 ◼ ► And I immediately just have alt-tab over to a browser where I had feeder-angler running,

00:04:28 ◼ ► or I try to refresh the page and I get 500 back,

00:04:33 ◼ ► like this system is down, what is going on?

00:04:37 ◼ ► I look over to the migration, it's finished, it's done.

00:04:40 ◼ ► I look at it and I'm like, what happened?

00:04:42 ◼ ► So I open up my production logs,

00:04:45 ◼ ► I look in the database area, and I realize that

00:04:47 ◼ ► rather than simply applying the small subset of changes

00:04:52 ◼ ► to the dataset that it was supposed to do,

00:04:54 ◼ ► the migration had actually gone through,

00:04:57 ◼ ► and, well actually, maybe a better way to start,

00:05:00 ◼ ► to say, to tell a story.

00:05:02 ◼ ► So I open up my production log,

00:05:03 ◼ ► and I see the two words that anyone who's ever done

00:05:06 ◼ ► any database work, or any web work,

00:05:08 ◼ ► never wanted to see, and it's a drop table.

00:05:12 ◼ ► So yeah, what it had actually done,

00:05:14 ◼ ► is it had gone through and dropped and recreated

00:05:18 ◼ ► every table in Feed Wrangler,

00:05:20 ◼ ► which is awkward because having dropped and recreated every table in the database, it

00:05:26 ◼ ► means that there was no data.

00:05:28 ◼ ► All the data is gone.

00:05:30 ◼ ► All the data was completely gone from the system.

00:05:36 ◼ ► All my user table, my articles, user status, everything, just gone.

00:05:41 ◼ ► And so at that point, I had a moment of panic.

00:05:45 ◼ ► I think the only experience I can think of that was a similar kind of-- similar to this viscerally,

00:05:50 ◼ ► if you're a parent of a small child, you probably experienced this, where you see your kid about to

00:05:56 ◼ ► do something that's going to hurt them slightly. So the classic one is if you see them, they're

00:06:01 ◼ ► about to sort of fall out of a chair, or they're about to trip over something, and you have that

00:06:06 ◼ ► moment of just like, "Ah!" And you just freak out. I mean, what are you going to do? And then you go--

00:06:14 ◼ ► So I have that moment of freak out, and I'm like, "What just happened? It's all gone."

00:06:18 ◼ ► So immediately I go into the database, and shut down the database, and then in the hopes

00:06:24 ◼ ► that it hadn't yet propagated off to my hot standby slave.

00:06:29 ◼ ► So I go in there, and of course the slave had already propagated, the slave had no data either.

00:06:34 ◼ ► And so now you have an interesting situation.

00:06:39 ◼ ► thing. You kind of have, like a good reasonable backup strategy is kind of like an onion,

00:06:46 ◼ ► I guess you could say. It's got many layers. You're dealing with different failures and

00:06:50 ◼ ► different probabilities and different intensities of failures at different levels in your backup

00:06:55 ◼ ► strategy. So at your first level, you're kind of worried about things like database fit,

00:07:01 ◼ ► like disk errors, disk failures, those types of things. And that's what something like

00:07:05 ◼ ► having a hot stand by a read slave is really helpful for.

00:07:10 ◼ ► So if my master database died for whatever reason,

00:07:13 ◼ ► like just the hard drive's dead, something happened,

00:07:15 ◼ ► which hopefully isn't that likely given that it's on a VPS,

00:07:20 ◼ ► it's on a rated disk drive, but these things happen.

00:07:23 ◼ ► If some crazy disk error happens

00:07:26 ◼ ► and it isn't a logical failure, it's a physical failure,

00:07:28 ◼ ► then the read slave would have been fine.

00:07:31 ◼ ► I could have just told it to become a master

00:07:30 ◼ ► and turn it into the primary.

00:07:32 ◼ ► So that's great, I would have been able to do that.

00:07:35 ◼ ► So that's that first level of backup I have.

00:07:37 ◼ ► Then you have a thing in Postgres where you can do

00:07:39 ◼ ► point in time recovery with right ahead log archiving.

00:07:44 ◼ ► You have that kind of thing.

00:07:45 ◼ ► And then I have on top of that,

00:07:48 ◼ ► I have the just like system level backups,

00:07:52 ◼ ► which is, I essentially run something that is

00:07:55 ◼ ► taking whole machine snapshots on regular basis,

00:07:59 ◼ ► and then it archives them at regular intervals.

00:08:04 ◼ ► I think there's a couple that run every couple hours,

00:08:06 ◼ ► then it keeps the most recent ones for a week,

00:08:12 ◼ ► and then it keeps one of them for each week for a month

00:08:17 ◼ ► or something like that.

00:08:21 ◼ ► It's kind of a cascading system.

00:08:22 ◼ ► And so now after the initial panic works out,

00:08:26 ◼ ► this is what I'm trained for kind of mode,

00:08:31 ◼ ► of like, what do I do?

00:08:33 ◼ ► How do I recover this situation

00:08:36 ◼ ► in the way that is least impactful to users?

00:08:38 ◼ ► And obviously, in parallel to this,

00:08:42 ◼ ► I'm sitting there on Twitter and in other ways

00:08:44 ◼ ► trying to communicate with customers and saying,

00:08:46 ◼ ► yeah, there's an outage, things gone wrong.

00:08:49 ◼ ► And at first, I didn't really know

00:08:51 ◼ ► how bad it was going to be,

00:08:53 ◼ ► and so my messages are kind of more vague.

00:08:54 ◼ ► It's just like, sorry about the outage,

00:08:52 ◼ ► I'm working on it.

00:08:54 ◼ ► The next one's like it's gonna be a couple of,

00:08:55 ◼ ► it's gonna be at least a few minutes

00:08:56 ◼ ► 'cause I need to recover from a database state.

00:09:01 ◼ ► And then at one point, so then I kind of,

00:09:04 ◼ ► while I'm doing that kind of customer side,

00:09:06 ◼ ► I'm also focused on what should I do?

00:09:09 ◼ ► So logical recovery is out of the question

00:09:13 ◼ ► because like I said, the read slave

00:09:15 ◼ ► had also had this happen to it.

00:09:16 ◼ ► So for some reason it had propagated off

00:09:20 ◼ ► into the read slave as well.

00:09:22 ◼ ► And so then I had to do some kind of lower level recovery.

00:09:27 ◼ ► So the first part I could do was kind of a point in time

00:09:28 ◼ ► thing with Postgres where you take the write ahead logs

00:09:31 ◼ ► and replay them to a certain point

00:09:34 ◼ ► and right before the failure happened.

00:09:36 ◼ ► And I looked at that and I could have tried to do it.

00:09:40 ◼ ► And so I was like, okay, well, I'll start looking at that.

00:09:44 ◼ ► And then line node, which is where I'm,

00:09:46 ◼ ► manages the backup system for me that I suppose ended up

00:09:49 ◼ ► actually being the thing that I used,

00:09:50 ◼ ► also lets you, you can take a backup

00:09:52 ◼ ► and create a spin up a whole new machine

00:09:54 ◼ ► that's a clone of that snapshot point in time.

00:09:56 ◼ ► So I go into Liner and I say,

00:09:57 ◼ ► "Hey, rebuild this, you know, my most recent database

00:10:01 ◼ ► "as a whole new machine, and just sort of start that process."

00:10:05 ◼ ► 'Cause it takes a few minutes while it copies a lot of data

00:10:08 ◼ ► and gets the machine set up.

00:10:09 ◼ ► So I was like, "Okay, I'm gonna kick that off,

00:10:11 ◼ ► "I'm gonna have that going as my fallback,

00:10:13 ◼ ► "and I'm gonna immediately start jumping

00:10:14 ◼ ► "into the actual recovery."

00:10:18 ◼ ► So I started looking at the point-in-time recovery.

00:10:23 ◼ ► And long story short, it ended up not working

00:10:24 ◼ ► quite as quickly as I would have liked.

00:10:27 ◼ ► I think I probably could have gotten into work,

00:10:29 ◼ ► but it's a very fiddly process,

00:10:31 ◼ ► at least from me not being a professional Postgres DBA,

00:10:33 ◼ ► to get it working exactly right.

00:10:37 ◼ ► And from a data integrity perspective,

00:10:38 ◼ ► I was a little bit concerned of leaving the database

00:10:40 ◼ ► in an inconsistent state or an unstable state,

00:10:42 ◼ ► or things that will ultimately come back and bite me.

00:10:45 ◼ ► And so instead, the main line node came back up.

00:10:48 ◼ ► I sort of fired up the database,

00:10:53 ◼ ► and it wasn't as bad as I feared.

00:10:55 ◼ ► It was only a couple of hours,

00:10:58 ◼ ► or maybe several hours, probably,

00:11:00 ◼ ► I think it was about four hours-ish,

00:11:01 ◼ ► four or five hours of data that had been lost.

00:11:03 ◼ ► So that's not great.

00:11:07 ◼ ► The nature, at least, of RSS syncing

00:11:08 ◼ ► is that the data that most of what it's containing

00:11:11 ◼ ► is widely distributed across the internet,

00:11:14 ◼ ► and I'm just caching it.

00:11:13 ◼ ► comes back up, I can go and re-download all the articles.

00:11:15 ◼ ► And users are mostly just missing their annotations,

00:11:19 ◼ ► whether they've read it or not, stars, those types of things.

00:11:23 ◼ ► I had, I think, two users who created accounts

00:11:25 ◼ ► during that period that I had to go in and manually fix.

00:11:27 ◼ ► But generally, it wasn't too bad.

00:11:30 ◼ ► Got that spun up.

00:11:31 ◼ ► Actually, as a random bonus, I ended up--

00:11:33 ◼ ► it was a bit of a silver lining.

00:11:34 ◼ ► I upgraded the hardware that the machine runs on,

00:11:36 ◼ ► which I think I've been meaning to do for a while.

00:11:38 ◼ ► But I hadn't done it because it would have required

00:11:40 ◼ ► a few minutes of downtime.

00:11:41 ◼ ► So I figured, hey, while it's all systems down,

00:11:46 ◼ ► let's go ahead and run the upgrade.

00:11:49 ◼ ► So I did, so it's running on faster hardware now,

00:11:51 ◼ ► and it went pretty well.

00:11:54 ◼ ► You never really want to test your backup regimen.

00:11:57 ◼ ► It's the kind of thing that you,

00:11:59 ◼ ► or you never want to have to go beyond testing.

00:12:01 ◼ ► I've played with these before.

00:12:04 ◼ ► I've done restores, my backups,

00:12:05 ◼ ► and I've kind of gone through this process a little bit.

00:12:06 ◼ ► But to actually have to go and rebuild the entire dataset

00:12:07 ◼ ► from backup is terrifying.

00:12:10 ◼ ► And I'm just glad it worked.

00:12:13 ◼ ► Things are back up and running now.

00:12:14 ◼ ► Things are actually faster than they were before the outage

00:12:16 ◼ ► because, like I said, I upgraded all the hardware

00:12:19 ◼ ► that it's running on.

00:12:20 ◼ ► But I definitely don't want to do that again.

00:12:23 ◼ ► And so I definitely won't be going anywhere

00:12:25 ◼ ► near database migrations in Rails for quite some time.

00:12:28 ◼ ► I still haven't worked out what happened.

00:12:31 ◼ ► I'm sure there was just something in the way

00:12:32 ◼ ► that it was configured.

00:12:34 ◼ ► I'd recently done a Git merge into that repository

00:12:37 ◼ ► that I think maybe have messed something up.

00:12:40 ◼ ► But actually, I think I know somebody

00:12:42 ◼ ► who can say this better than I did.

00:12:43 ◼ ► And just replace the word "season pass"

00:12:47 ◼ ► with "database" in this clip.

00:12:48 ◼ ► And I think you'll have a pretty good sense

00:12:50 ◼ ► of what I'll be doing for the next while.

00:12:53 ◼ ► - Okay, let's never touch that website again.

00:12:55 ◼ ► Does the season pass migration thing?

00:12:58 ◼ ► Obviously is pure evil.

00:13:00 ◼ ► We can never go to that page again.

00:13:01 ◼ ► Don't even load that page,

00:13:02 ◼ ► because if we accidentally load that page,

00:13:03 ◼ ► for all we know, it's going to scramble

00:13:05 ◼ ► our season passes again.

00:13:05 ◼ ► And then I had to start the whole process all over again of rearranging everything.

00:13:09 ◼ ► Not a good user experience.

00:13:13 ◼ ► Exactly. So we're not going anywhere near migrations for a while.

00:13:17 ◼ ► We're just going to stick with what we know. I'm going to be running some SQL.

00:13:21 ◼ ► You need to add a field to a database. I will be going into Postgres. I will be running exactly

00:13:25 ◼ ► the alter command that I know is safe and won't destroy things.

00:13:29 ◼ ► I'll be retyping the birds drop table again in my life.

00:13:33 ◼ ► hopefully things are going well.

00:13:38 ◼ ► One thing that I think that was really encouraging

00:13:41 ◼ ► coming out of this, and I know this is certainly part

00:13:43 ◼ ► of just having a fairly geeky customer base,

00:13:45 ◼ ► is I tried to be very honest about what was happening

00:13:47 ◼ ► during the process and kind of what happened

00:13:49 ◼ ► after the process, and being very explicit about it.

00:13:51 ◼ ► It's not just kind of like, something happened.

00:13:53 ◼ ► It's kind of like the dev center outage,

00:13:55 ◼ ► where it's this very vague, like no one really knows

00:13:56 ◼ ► what's going on.

00:13:58 ◼ ► I try to be very specific.

00:13:59 ◼ ► I didn't get quite as many details as I did

00:14:00 ◼ ► just now on the show, but I try to be very explicit.

00:14:01 ◼ ► I think a lot of people, customers, really appreciate that.

00:14:06 ◼ ► Saying like, "Yep, sorry man, that happens.

00:14:08 ◼ ► Hope you hope tomorrow's a better day for you."

00:14:11 ◼ ► Which is just really encouraging to me.

00:14:14 ◼ ► And I think it's a good example of just being honest

00:14:15 ◼ ► as best you can is often just what customers want.

00:14:17 ◼ ► They understand that you're human at the end of the day

00:14:21 ◼ ► and you're just going to do your best.

00:14:23 ◼ ► All right, that's it for today's show.

00:14:24 ◼ ► As always, if you have questions,

00:14:26 ◼ ► comments, concerns, or complaints,

00:14:27 ◼ ► I'm on Twitter @_davidsmith,

00:14:28 ◼ ► david@developingperspective.com.

PodSearch

Developing Perspective

#137: Catastrophic Failure