#137: Catastrophic Failure
00:00:00
◼
►
Hello and welcome to Developing Perspective. Developing Perspective is a podcast discussing
00:00:04
◼
►
news of note in iOS development, Apple, and the like. I'm your host, David Smith. I'm
00:00:08
◼
►
an independent iOS and Mac developer based in Heron, Virginia. This is show number 137,
00:00:13
◼
►
and today is Thursday, August 8th, 2013. Developing Perspective is never longer than 15 minutes,
00:00:18
◼
►
so let's get started.
00:00:19
◼
►
All right, so I'm going to be talking about catastrophic failure today. It's a thrilling
00:00:25
◼
►
tale involving much sadness, much weeping, much gnashing of teeth that I had yesterday.
00:00:31
◼
►
So that was Wednesday, August 7th.
00:00:34
◼
►
If you're a feed wrangler user, you would have noticed that the service was down for
00:00:37
◼
►
about, I think it was about a half an hour, 35 minutes, I think is where I ended up with
00:00:41
◼
►
the actual outage being.
00:00:44
◼
►
And so I'm going to kind of walk through what happened.
00:00:46
◼
►
It was certainly a first for me.
00:00:48
◼
►
And it's one of those things that you kind of, you prepare for, you plan for, but you
00:00:53
◼
►
you really, really hope that this day never comes.
00:00:58
◼
►
And so when it does come, it's kind of an event.
00:01:00
◼
►
And I thought it'd be interesting
00:01:03
◼
►
to kind of walk through what happened
00:01:04
◼
►
and to talk about how I prepared for it,
00:01:05
◼
►
what happened, what I've done,
00:01:09
◼
►
and just kind of hopefully just put that out there
00:01:11
◼
►
and be transparent.
00:01:14
◼
►
Obviously, a lot of you, I think, are also customers,
00:01:15
◼
►
so maybe it's interesting to hear what happened.
00:01:17
◼
►
All right, so a bit of background first.
00:01:19
◼
►
I'm a, in addition to doing iOS and Mac,
00:01:20
◼
►
I've done Ruby on Rails, which is a web framework.
00:01:25
◼
►
I've worked professionally for,
00:01:28
◼
►
it's coming up in I think about seven or eight years.
00:01:30
◼
►
I started learning and working with Ruby on Rails
00:01:33
◼
►
back just before it went 1.0.
00:01:35
◼
►
So I think it's quite a few years ago.
00:01:37
◼
►
And before I was an iOS guy,
00:01:40
◼
►
I was a professional Ruby on Rails consultant
00:01:41
◼
►
or I worked at a company working as a Rails guy.
00:01:43
◼
►
And I'm pretty familiar with it,
00:01:47
◼
►
and I've been doing it for a long time.
00:01:46
◼
►
I've worked through all the different changes to the language and to the way that it works.
00:01:51
◼
►
I would say I know Rails pretty well.
00:01:55
◼
►
I'm not the scary expert in Rails, because I'm a bit too much fingers in too many pies.
00:01:57
◼
►
I focus a lot more on iOS and Mac kind of stuff now.
00:02:04
◼
►
So it's hard to-- I'm not quite as completely in that world as I used to be.
00:02:07
◼
►
But I'd say I'm pretty good, more than competent.
00:02:12
◼
►
If I was looking at a technical book, you know, in the back, they have the novice, intermediate, advanced,
00:02:15
◼
►
a little thermometer.
00:02:17
◼
►
I'd probably be in the advanced section on the thermometer.
00:02:20
◼
►
So I know Rails pretty well.
00:02:22
◼
►
So yesterday I was doing some work,
00:02:24
◼
►
and part of what this did is it included
00:02:27
◼
►
a couple of what Rails calls database migrations.
00:02:29
◼
►
And database migrations are always something
00:02:31
◼
►
you want to be a little careful with,
00:02:32
◼
►
but essentially what they do is in Rails,
00:02:35
◼
►
you describe a, every time you want to make
00:02:37
◼
►
a database change, you describe it into a file
00:02:40
◼
►
that's designed to be kind of running sequence.
00:02:42
◼
►
And then when you run that migration, it goes through, works out what the most recent one
00:02:47
◼
►
that you've run is, and then runs them through all the new migrations that happen beyond
00:02:54
◼
►
So this is typically things like adding columns to a database table, adding an index, that
00:02:58
◼
►
kind of thing.
00:02:59
◼
►
And so this is kind of what we had.
00:03:01
◼
►
The work I was doing included one of those.
00:03:04
◼
►
And then essentially what I do when I have a database change like this that's part of
00:03:08
◼
►
some logic or some code that I'm updating.
00:03:11
◼
►
I hit run my deploy script, and off it goes,
00:03:16
◼
►
and it does this migration as part of the rollout process.
00:03:20
◼
►
And so yesterday I've been doing this change,
00:03:24
◼
►
kind of something I've done millions of times,
00:03:26
◼
►
take it to my kind of staging database, run it there,
00:03:27
◼
►
everything looks good, no problems,
00:03:31
◼
►
things are looking good in terms of the code
00:03:33
◼
►
and the performance, I'm like, "Yeah, this is great."
00:03:35
◼
►
So I hit deploy.
00:03:37
◼
►
Things are going along like normally.
00:03:40
◼
►
The code cycle part of it kind of happens first,
00:03:39
◼
►
kind of a two-phase process, and it rolled out the code, things looking good, and it
00:03:42
◼
►
gets to the database part. And that's where things got interesting. So at that point,
00:03:50
◼
►
I've watched probably thousands of these migration scripts being run over my years as a Rails
00:03:55
◼
►
developer, and immediately I knew something was funny. Something wasn't going quite right,
00:04:03
◼
►
because it was telling me about how it was creating automatic primary keys.
00:04:08
◼
►
And based on what I was doing, that shouldn't have been part of it.
00:04:13
◼
►
What I was doing, it was also affecting tables that weren't related to the changes that I was making.
00:04:18
◼
►
So I'm like, "Wait a second. What's going on?"
00:04:23
◼
►
And I immediately just have alt-tab over to a browser where I had feeder-angler running,
00:04:28
◼
►
or I try to refresh the page and I get 500 back,
00:04:33
◼
►
like this system is down, what is going on?
00:04:37
◼
►
I look over to the migration, it's finished, it's done.
00:04:40
◼
►
I look at it and I'm like, what happened?
00:04:42
◼
►
So I open up my production logs,
00:04:45
◼
►
I look in the database area, and I realize that
00:04:47
◼
►
rather than simply applying the small subset of changes
00:04:52
◼
►
to the dataset that it was supposed to do,
00:04:54
◼
►
the migration had actually gone through,
00:04:57
◼
►
and, well actually, maybe a better way to start,
00:05:00
◼
►
to say, to tell a story.
00:05:02
◼
►
So I open up my production log,
00:05:03
◼
►
and I see the two words that anyone who's ever done
00:05:06
◼
►
any database work, or any web work,
00:05:08
◼
►
never wanted to see, and it's a drop table.
00:05:12
◼
►
So yeah, what it had actually done,
00:05:14
◼
►
is it had gone through and dropped and recreated
00:05:18
◼
►
every table in Feed Wrangler,
00:05:20
◼
►
which is awkward because having dropped and recreated every table in the database, it
00:05:26
◼
►
means that there was no data.
00:05:28
◼
►
All the data is gone.
00:05:30
◼
►
All the data was completely gone from the system.
00:05:36
◼
►
All my user table, my articles, user status, everything, just gone.
00:05:41
◼
►
And so at that point, I had a moment of panic.
00:05:45
◼
►
I think the only experience I can think of that was a similar kind of-- similar to this viscerally,
00:05:50
◼
►
if you're a parent of a small child, you probably experienced this, where you see your kid about to
00:05:56
◼
►
do something that's going to hurt them slightly. So the classic one is if you see them, they're
00:06:01
◼
►
about to sort of fall out of a chair, or they're about to trip over something, and you have that
00:06:06
◼
►
moment of just like, "Ah!" And you just freak out. I mean, what are you going to do? And then you go--
00:06:14
◼
►
So I have that moment of freak out, and I'm like, "What just happened? It's all gone."
00:06:18
◼
►
So immediately I go into the database, and shut down the database, and then in the hopes
00:06:24
◼
►
that it hadn't yet propagated off to my hot standby slave.
00:06:29
◼
►
So I go in there, and of course the slave had already propagated, the slave had no data either.
00:06:34
◼
►
And so now you have an interesting situation.
00:06:39
◼
►
thing. You kind of have, like a good reasonable backup strategy is kind of like an onion,
00:06:46
◼
►
I guess you could say. It's got many layers. You're dealing with different failures and
00:06:50
◼
►
different probabilities and different intensities of failures at different levels in your backup
00:06:55
◼
►
strategy. So at your first level, you're kind of worried about things like database fit,
00:07:01
◼
►
like disk errors, disk failures, those types of things. And that's what something like
00:07:05
◼
►
having a hot stand by a read slave is really helpful for.
00:07:10
◼
►
So if my master database died for whatever reason,
00:07:13
◼
►
like just the hard drive's dead, something happened,
00:07:15
◼
►
which hopefully isn't that likely given that it's on a VPS,
00:07:20
◼
►
it's on a rated disk drive, but these things happen.
00:07:23
◼
►
If some crazy disk error happens
00:07:26
◼
►
and it isn't a logical failure, it's a physical failure,
00:07:28
◼
►
then the read slave would have been fine.
00:07:31
◼
►
I could have just told it to become a master
00:07:30
◼
►
and turn it into the primary.
00:07:32
◼
►
So that's great, I would have been able to do that.
00:07:35
◼
►
So that's that first level of backup I have.
00:07:37
◼
►
Then you have a thing in Postgres where you can do
00:07:39
◼
►
point in time recovery with right ahead log archiving.
00:07:44
◼
►
You have that kind of thing.
00:07:45
◼
►
And then I have on top of that,
00:07:48
◼
►
I have the just like system level backups,
00:07:52
◼
►
which is, I essentially run something that is
00:07:55
◼
►
taking whole machine snapshots on regular basis,
00:07:59
◼
►
and then it archives them at regular intervals.
00:08:04
◼
►
I think there's a couple that run every couple hours,
00:08:06
◼
►
then it keeps the most recent ones for a week,
00:08:12
◼
►
and then it keeps one of them for each week for a month
00:08:17
◼
►
or something like that.
00:08:21
◼
►
It's kind of a cascading system.
00:08:22
◼
►
And so now after the initial panic works out,
00:08:26
◼
►
this is what I'm trained for kind of mode,
00:08:31
◼
►
of like, what do I do?
00:08:33
◼
►
How do I recover this situation
00:08:36
◼
►
in the way that is least impactful to users?
00:08:38
◼
►
And obviously, in parallel to this,
00:08:42
◼
►
I'm sitting there on Twitter and in other ways
00:08:44
◼
►
trying to communicate with customers and saying,
00:08:46
◼
►
yeah, there's an outage, things gone wrong.
00:08:49
◼
►
And at first, I didn't really know
00:08:51
◼
►
how bad it was going to be,
00:08:53
◼
►
and so my messages are kind of more vague.
00:08:54
◼
►
It's just like, sorry about the outage,
00:08:52
◼
►
I'm working on it.
00:08:54
◼
►
The next one's like it's gonna be a couple of,
00:08:55
◼
►
it's gonna be at least a few minutes
00:08:56
◼
►
'cause I need to recover from a database state.
00:09:01
◼
►
And then at one point, so then I kind of,
00:09:04
◼
►
while I'm doing that kind of customer side,
00:09:06
◼
►
I'm also focused on what should I do?
00:09:09
◼
►
So logical recovery is out of the question
00:09:13
◼
►
because like I said, the read slave
00:09:15
◼
►
had also had this happen to it.
00:09:16
◼
►
So for some reason it had propagated off
00:09:20
◼
►
into the read slave as well.
00:09:22
◼
►
And so then I had to do some kind of lower level recovery.
00:09:27
◼
►
So the first part I could do was kind of a point in time
00:09:28
◼
►
thing with Postgres where you take the write ahead logs
00:09:31
◼
►
and replay them to a certain point
00:09:34
◼
►
and right before the failure happened.
00:09:36
◼
►
And I looked at that and I could have tried to do it.
00:09:40
◼
►
And so I was like, okay, well, I'll start looking at that.
00:09:44
◼
►
And then line node, which is where I'm,
00:09:46
◼
►
manages the backup system for me that I suppose ended up
00:09:49
◼
►
actually being the thing that I used,
00:09:50
◼
►
also lets you, you can take a backup
00:09:52
◼
►
and create a spin up a whole new machine
00:09:54
◼
►
that's a clone of that snapshot point in time.
00:09:56
◼
►
So I go into Liner and I say,
00:09:57
◼
►
"Hey, rebuild this, you know, my most recent database
00:10:01
◼
►
"as a whole new machine, and just sort of start that process."
00:10:05
◼
►
'Cause it takes a few minutes while it copies a lot of data
00:10:08
◼
►
and gets the machine set up.
00:10:09
◼
►
So I was like, "Okay, I'm gonna kick that off,
00:10:11
◼
►
"I'm gonna have that going as my fallback,
00:10:13
◼
►
"and I'm gonna immediately start jumping
00:10:14
◼
►
"into the actual recovery."
00:10:18
◼
►
So I started looking at the point-in-time recovery.
00:10:23
◼
►
And long story short, it ended up not working
00:10:24
◼
►
quite as quickly as I would have liked.
00:10:27
◼
►
I think I probably could have gotten into work,
00:10:29
◼
►
but it's a very fiddly process,
00:10:31
◼
►
at least from me not being a professional Postgres DBA,
00:10:33
◼
►
to get it working exactly right.
00:10:37
◼
►
And from a data integrity perspective,
00:10:38
◼
►
I was a little bit concerned of leaving the database
00:10:40
◼
►
in an inconsistent state or an unstable state,
00:10:42
◼
►
or things that will ultimately come back and bite me.
00:10:45
◼
►
And so instead, the main line node came back up.
00:10:48
◼
►
I sort of fired up the database,
00:10:53
◼
►
and it wasn't as bad as I feared.
00:10:55
◼
►
It was only a couple of hours,
00:10:58
◼
►
or maybe several hours, probably,
00:11:00
◼
►
I think it was about four hours-ish,
00:11:01
◼
►
four or five hours of data that had been lost.
00:11:03
◼
►
So that's not great.
00:11:07
◼
►
The nature, at least, of RSS syncing
00:11:08
◼
►
is that the data that most of what it's containing
00:11:11
◼
►
is widely distributed across the internet,
00:11:14
◼
►
and I'm just caching it.
00:11:13
◼
►
comes back up, I can go and re-download all the articles.
00:11:15
◼
►
And users are mostly just missing their annotations,
00:11:19
◼
►
whether they've read it or not, stars, those types of things.
00:11:23
◼
►
I had, I think, two users who created accounts
00:11:25
◼
►
during that period that I had to go in and manually fix.
00:11:27
◼
►
But generally, it wasn't too bad.
00:11:30
◼
►
Got that spun up.
00:11:31
◼
►
Actually, as a random bonus, I ended up--
00:11:33
◼
►
it was a bit of a silver lining.
00:11:34
◼
►
I upgraded the hardware that the machine runs on,
00:11:36
◼
►
which I think I've been meaning to do for a while.
00:11:38
◼
►
But I hadn't done it because it would have required
00:11:40
◼
►
a few minutes of downtime.
00:11:41
◼
►
So I figured, hey, while it's all systems down,
00:11:46
◼
►
let's go ahead and run the upgrade.
00:11:49
◼
►
So I did, so it's running on faster hardware now,
00:11:51
◼
►
and it went pretty well.
00:11:54
◼
►
You never really want to test your backup regimen.
00:11:57
◼
►
It's the kind of thing that you,
00:11:59
◼
►
or you never want to have to go beyond testing.
00:12:01
◼
►
I've played with these before.
00:12:04
◼
►
I've done restores, my backups,
00:12:05
◼
►
and I've kind of gone through this process a little bit.
00:12:06
◼
►
But to actually have to go and rebuild the entire dataset
00:12:07
◼
►
from backup is terrifying.
00:12:10
◼
►
And I'm just glad it worked.
00:12:13
◼
►
Things are back up and running now.
00:12:14
◼
►
Things are actually faster than they were before the outage
00:12:16
◼
►
because, like I said, I upgraded all the hardware
00:12:19
◼
►
that it's running on.
00:12:20
◼
►
But I definitely don't want to do that again.
00:12:23
◼
►
And so I definitely won't be going anywhere
00:12:25
◼
►
near database migrations in Rails for quite some time.
00:12:28
◼
►
I still haven't worked out what happened.
00:12:31
◼
►
I'm sure there was just something in the way
00:12:32
◼
►
that it was configured.
00:12:34
◼
►
I'd recently done a Git merge into that repository
00:12:37
◼
►
that I think maybe have messed something up.
00:12:40
◼
►
But actually, I think I know somebody
00:12:42
◼
►
who can say this better than I did.
00:12:43
◼
►
And just replace the word "season pass"
00:12:47
◼
►
with "database" in this clip.
00:12:48
◼
►
And I think you'll have a pretty good sense
00:12:50
◼
►
of what I'll be doing for the next while.
00:12:53
◼
►
- Okay, let's never touch that website again.
00:12:55
◼
►
Does the season pass migration thing?
00:12:58
◼
►
Obviously is pure evil.
00:13:00
◼
►
We can never go to that page again.
00:13:01
◼
►
Don't even load that page,
00:13:02
◼
►
because if we accidentally load that page,
00:13:03
◼
►
for all we know, it's going to scramble
00:13:05
◼
►
our season passes again.
00:13:05
◼
►
And then I had to start the whole process all over again of rearranging everything.
00:13:09
◼
►
Not a good user experience.
00:13:13
◼
►
Exactly. So we're not going anywhere near migrations for a while.
00:13:17
◼
►
We're just going to stick with what we know. I'm going to be running some SQL.
00:13:21
◼
►
You need to add a field to a database. I will be going into Postgres. I will be running exactly
00:13:25
◼
►
the alter command that I know is safe and won't destroy things.
00:13:29
◼
►
I'll be retyping the birds drop table again in my life.
00:13:33
◼
►
hopefully things are going well.
00:13:38
◼
►
One thing that I think that was really encouraging
00:13:41
◼
►
coming out of this, and I know this is certainly part
00:13:43
◼
►
of just having a fairly geeky customer base,
00:13:45
◼
►
is I tried to be very honest about what was happening
00:13:47
◼
►
during the process and kind of what happened
00:13:49
◼
►
after the process, and being very explicit about it.
00:13:51
◼
►
It's not just kind of like, something happened.
00:13:53
◼
►
It's kind of like the dev center outage,
00:13:55
◼
►
where it's this very vague, like no one really knows
00:13:56
◼
►
what's going on.
00:13:58
◼
►
I try to be very specific.
00:13:59
◼
►
I didn't get quite as many details as I did
00:14:00
◼
►
just now on the show, but I try to be very explicit.
00:14:01
◼
►
I think a lot of people, customers, really appreciate that.
00:14:06
◼
►
Saying like, "Yep, sorry man, that happens.
00:14:08
◼
►
Hope you hope tomorrow's a better day for you."
00:14:11
◼
►
Which is just really encouraging to me.
00:14:14
◼
►
And I think it's a good example of just being honest
00:14:15
◼
►
as best you can is often just what customers want.
00:14:17
◼
►
They understand that you're human at the end of the day
00:14:21
◼
►
and you're just going to do your best.
00:14:23
◼
►
All right, that's it for today's show.
00:14:24
◼
►
As always, if you have questions,
00:14:26
◼
►
comments, concerns, or complaints,
00:14:27
◼
►
I'm on Twitter @_davidsmith,
00:14:28
◼
►
david@developingperspective.com.