Podcast Search

00:00:00 ◼ ► Welcome to Under the Radar, a show about independent iOS app development and server problems.

00:00:05 ◼ ► I'm Marco Arment.

00:00:06 ◼ ► And I'm David Smith. Under the Radar is never longer than 30 minutes, so let's get started.

00:00:10 ◼ ► I do not like the sound of that, Marco. Server problems, that is not the way, that is not a topic that I enjoy talking about.

00:00:17 ◼ ► So, but hopefully we can work through this and you can unload some of your frustration because I'm all too aware of that pain.

00:00:25 ◼ ► See, right now things seem pretty stable on my servers, but that was not the case during parts of the last week or two.

00:00:33 ◼ ► The thing is, like, running servers, it's this big scary thing to a lot of developers who have never done it before.

00:00:39 ◼ ► And they, you know, a lot of apps could be made way better with a server component, even a fairly light server component.

00:00:47 ◼ ► One that doesn't take a lot of servers or a lot of maintenance.

00:00:50 ◼ ► So many apps can benefit from that. So many types of features or, you know, other implementation details almost require a server.

00:01:00 ◼ ► And if you don't have your own server, then you have to go to external services for certain things, like sending push notifications to people or, you know, processing subscription payments or whatever else.

00:01:10 ◼ ► And so if you can do a lot of those things yourself on servers, you can simplify things with, you know, billing and privacy, you can usually save money because, like, doing things like push notifications yourself costs, like, nothing.

00:01:24 ◼ ► But paying a service usually costs something. And so there's a lot of advantages to running your own servers if you're an app developer.

00:01:32 ◼ ► But what many people say is, "Oh, I don't want to deal with it," or "I don't know how to do it and I'm scared of downtime or upgrades or security issues or whatever else."

00:01:42 ◼ ► And most of the time I'm able to look at those people and say, "Hey, you know what? It's pretty easy. It's not a big deal. It's not as hard as you think. Don't worry about it. You can do it."

00:01:53 ◼ ► And usually I and we encourage developers, if your app could benefit from a server component, just try to do it, try to set it up yourself, you know, set it up on frequent sponsor Linode, that's what we do, and it's usually fine.

00:02:07 ◼ ► And there's good documentation, there's good, you know, help from search engines out there on the web and Stack Overflow and, you know, the hosts like Linode and DigitalOcean, they all have their own documentation also to help people run Linux servers.

00:02:18 ◼ ► So usually it's pretty easy, but that usually is not always.

00:02:25 ◼ ► And this is when I'm good at running servers. Like for the most part, this is a qualification I have from past jobs that now running Overcast servers, it's a smaller scale than what I have operated in the past.

00:02:46 ◼ ► And so it's kind of easy for me most of the time. But again, most of the time.

00:02:52 ◼ ► And the vast majority of days, I don't have to think about my servers at all. They run themselves, I have a lot of things monitored and scripted and set up in such a way that they mostly run themselves.

00:03:09 ◼ ► And a few times a year I have to go intervene, or there's some upgrade that I want to do that I have to do manually.

00:03:19 ◼ ► Or something's just getting really, really old and it's starting to become irresponsible to run software that old because it's no longer maintained to the level I want it to be maintained.

00:03:28 ◼ ► And so certain times you kind of have to go update stuff for some reason. Or maybe your database needs more disk space, so you have to upgrade it to the next size up thing.

00:03:43 ◼ ► And in many cases there is downtime associated with whenever you migrate a server to a bigger server. And so you have choices like, well, I can upgrade the server in place, but it might be down for an hour, and then my whole service is down for an hour.

00:03:59 ◼ ► Or I can do a more complicated kind of switchover thing that minimizes or eliminates downtime for my customers.

00:04:05 ◼ ► And so there's all sorts of things like that, that running servers entails decisions like this, having to deal with problems like this very occasionally.

00:04:13 ◼ ► And for the most part I'm pretty good at not having major server problems most of the time. However, this past week that was different.

00:04:22 ◼ ► Because I've been, over the last few weeks, I've been doing a kind of rolling server upgrade. I discovered a problem I wanted to fix that was related to like, I was using an old version of PHP and there were a couple other like old things.

00:04:38 ◼ ► And I'm like, you know, this would be better if I was using the newer things. And I had already done some servers on like a newer distro and stuff, and it was weird having a mixed environment, because I came across like weird bugs that would depend on like which server had processed something, because they would process things slightly differently, because they had different versions of things.

00:04:59 ◼ ► So I'm like, alright, let me consolidate, let me upgrade everything, I'll replace all of my old servers with new servers, I'll do it all live and with graceful failovers so most people shouldn't notice.

00:05:11 ◼ ► And I did and it went great. I did almost everything, totally flawlessly, and it was a great upgrade. Until it came time to move one of my databases. Now I run MySQL, MySQL does, it lacks in certain tooling areas that all the Postgres people always tell me all about.

00:05:35 ◼ ► But for the most part it's pretty solid. And I know MySQL very well, I don't know any of its alternatives at all, and MySQL has been so reliable for me over the years that I stick with it because, again, it's what I know, the tooling out there is pretty good for it, it's well understood, I know exactly what it can handle and what it can't.

00:05:56 ◼ ► And so that's what I usually use. It came time to move the MySQL server. And there was a deadline on this because there is one thing about Linode that occasionally gives me a snag.

00:06:12 ◼ ► And I'm saying that, you know I'm being honest because they're a sponsor and I wouldn't, they're not sponsoring this episode, but I wouldn't candy code it just because they're a sponsor. The only thing about Linode that ever creates work for me that is not my fault is if you run a server there for a very long time, at some point they might say, we need to move your virtual server to a different physical host because the one it's on is too old.

00:06:39 ◼ ► And we're retiring this old fleet or whatever. And my database servers are the ones I keep the longest because it's the biggest pain in the butt to move them.

00:06:47 ◼ ► So they finally, it came time that starting, I believe, tonight, or tomorrow night, they had a scheduled forced migration that my database server was going to be migrated at this time whether I want it to be or not because it had been so long and they had to retire its old hardware.

00:07:05 ◼ ► To move my database server, it's so big and the time it takes to move to a new server depends on disk size, it's so large that it was going to take like multiple hours of downtime to move it.

00:07:18 ◼ ► And I'm like, I don't really want Overcast to be forced down by multiple hours. So my solution when this problem comes up is always, okay, now is the time to migrate to a new server.

00:07:28 ◼ ► Because they give you like two weeks notice usually. So you have a while.

00:07:36 ◼ ► And it's probably fair to say you can also schedule it yourself in terms of they give you a deadline but you can activate it at any point between now and when you're at the end of your window is.

00:07:44 ◼ ► So if there is a convenient time to do it, you always could. And say like, okay, I'm going to do this at 2 a.m., be down for half an hour if you had a small enough disk size or whatever it is.

00:07:52 ◼ ► Like you can choose that but in your case that doesn't really work.

00:07:54 ◼ ► Right. And for almost all of my servers, they could take them down for a few hours whenever they want.

00:07:58 ◼ ► Because most of my servers have some kind of redundancy. And so like I have load balancers up front, obviously that's kind of a single point of failure.

00:08:05 ◼ ► But the load balancer spread the load between, right now I have eight different web servers.

00:08:10 ◼ ► And so that's, if one of the web servers goes down forcibly for some reason, nobody would even notice.

00:08:17 ◼ ► I would know because our sponsor, Pingdom, would tell me about it but nobody else would notice because it would just spread the request among the other ones and maybe one request might drop and that's it.

00:08:27 ◼ ► Sure.

00:08:28 ◼ ► And databases, I use MySQL replication and so I have one, each database cluster has one primary and one to two, or right now one to three replicas that just replicate whatever the primary does.

00:08:41 ◼ ► And then, you know, they, and the replicas can serve read queries that don't need to necessarily be guaranteed to be up to date.

00:08:48 ◼ ► So you know, lots of like kind of large bulk tasks, like if you're counting number of subscribers to a podcast periodically, like if it's off by one it doesn't matter.

00:08:58 ◼ ► Right. So you can do stuff like read those off the replicas and save a lot of that load off the primary.

00:09:03 ◼ ► If a replica goes down, it doesn't matter. It doesn't, like the app will automatically connect to other replicas or the primary if it needs to.

00:09:12 ◼ ► And so if a replica goes down, no big deal. So there's only a few servers in my setup where if they go down I need to care and primary databases are right up there.

00:09:22 ◼ ► They are the most important ones besides maybe the entry load balancers.

00:09:26 ◼ ► So I had to do something. I had to do a migration and as part of this I thought, well let me upgrade to the latest version of everything.

00:09:35 ◼ ► Because I do these upgrades so infrequently that I do use a semi-conservative Linux distro.

00:09:42 ◼ ► I use Ubuntu LTS for long term support. LTS releases I believe are guaranteed to have ten years of software upgrades after them.

00:09:49 ◼ ► And I was using LTS 18 on my most recent ones which came out in 2018. LTS 20 which came out in 2020 is now available.

00:09:59 ◼ ► So I thought, well let me set up all the new servers with that.

00:10:01 ◼ ► If you set that up you get MySQL 8 which is brand new, well not brand new but newish.

00:10:07 ◼ ► New to you.

00:10:09 ◼ ► Yeah, new to me for sure. I upgrade databases so infrequently that even a one or two year old release to me is new.

00:10:17 ◼ ► And I thought great, I'm on my way, this will be great. I've never had a problem with a MySQL version upgrade before.

00:10:24 ◼ ► Like it's never been worse than the previous version or broken in any way I noticed.

00:10:29 ◼ ► The old database server could handle this load. It was a high load.

00:10:34 ◼ ► Like it would be under load average 16 serving like 12,000 requests a second.

00:10:40 ◼ ► It was a heavy load for a database but the old server handled it.

00:10:44 ◼ ► And the new server should theoretically be faster because it had slightly higher specs.

00:10:50 ◼ ► That's a good time to upgrade those as well.

00:10:52 ◼ ► Slightly higher specs and it was on Linode's newer infrastructure and it had all these newer like,

00:10:57 ◼ ► LTS 20 is supposed to be faster than LTS 18 because of Linux kernel changes and all this other stuff.

00:11:02 ◼ ► So it's supposed to be faster. This my friends is when our sponsor Pingdom comes into play.

00:11:12 ◼ ► Because Pingdom alerted me at like 6.30 in the morning that morning. Something's wrong. Things are down.

00:11:20 ◼ ► Your day just got a lot worse.

00:11:22 ◼ ► Yeah, so we were brought to you this week by Pingdom. Do you have a website?

00:11:24 ◼ ► Does it have things like a shopping cart or registration forms or contact pages?

00:11:28 ◼ ► You need Pingdom. Nobody wants their website or its critical transactions to fail.

00:11:34 ◼ ► That means a bad experience for users, could mean lost business.

00:11:37 ◼ ► The good news is you can set up not only regular monitoring with Pingdom to say, "Is this page up? Tell me."

00:11:43 ◼ ► You can also have transaction monitoring.

00:11:45 ◼ ► This will alert you when things like cart checkout or forms or login pages fail before they affect your customers or your business.

00:11:52 ◼ ► Pingdom will let you know the moment any of them fail in whatever way is best for you.

00:11:56 ◼ ► You can customize how you're alerted, who is alerted depending on outage severity or conditions.

00:12:02 ◼ ► Pingdom really cares about your users having the smoothest site experience possible.

00:12:07 ◼ ► If disaster strikes, you will be the first one to know.

00:12:10 ◼ ► It is super easy to get started.

00:12:12 ◼ ► Go to Pingdom.com/RelayFM right now for a 30-day free trial with no credit card required.

00:12:19 ◼ ► When you sign up, use code RADAR at checkout to get a huge 30% off your first invoice.

00:12:25 ◼ ► Thank you to Pingdom from SolarWinds for their support of this show and RelayFM.

00:12:29 ◼ ► So, I get all these alerts from Pingdom in the way that I want it to be alerted, exactly depending on the severity of the outage, which was severe.

00:12:38 ◼ ► I get all these alerts at six in the morning.

00:12:40 ◼ ► At least it's six in the morning. I will say that, having done a lot of server stuff.

00:12:44 ◼ ► I will take the 6 a.m. alert rather than the "I'm about to go to sleep" alert and you're just like, "Oh, great. I'm not sleeping anymore."

00:12:51 ◼ ► That's fair. Yeah, I had a lot of those in the Tumblr days.

00:12:54 ◼ ► But in this case, it was the old databases that I was moving off of but were still active had run out of disk space.

00:13:00 ◼ ► Because part of MySQL replication is the source main server writes a log of everything it changes, called the binary log or bin log.

00:13:10 ◼ ► And then the replica servers read the bin logs and apply all those changes to them.

00:13:16 ◼ ► That way, as long as you start them both with a consistent data state, whatever changes in the primary will change in the replica.

00:13:21 ◼ ► And the replica sometimes falls behind a little bit, so the primary has to keep the certain amount of logs around,

00:13:27 ◼ ► like back a certain number of hours or days or whatever.

00:13:30 ◼ ► You can say, "Alright, retain 12 hours or 7 days of logs."

00:13:34 ◼ ► And that way, the replica can fall behind by up to that far and still have the data there to catch up if it can.

00:13:40 ◼ ► Well, in a server migration, I extend the amount of time I keep logs because it takes a while,

00:13:47 ◼ ► it takes a couple hours to copy all the data from one server to a new one that you're setting up

00:13:53 ◼ ► before it's in that replication state where it's just copying the changes.

00:13:56 ◼ ► Like the initial data set takes a few hours to copy over, so I needed a few hours of retention.

00:14:01 ◼ ► The old version of MySQL that the old servers were running didn't support specifying this value in hours.

00:14:07 ◼ ► It specified it in days.

00:14:09 ◼ ► And this database, not all my databases, but this one, which stores all the episodes and all the changes

00:14:14 ◼ ► anybody makes in their RSS feeds that apply to all of their episodes, this one, it has a massive write load

00:14:21 ◼ ► because it's just every RSS feed change that exists, right?

00:14:24 ◼ ► So it's a huge number of writes.

00:14:26 ◼ ► And the binary logs, I had temporarily turned down the thing that automatically prunes them faster than MySQL would,

00:14:34 ◼ ► that I had some shell script I wrote, because I needed more time than that to do the initial copy to the new servers.

00:14:41 ◼ ► And after the copy, I forgot to re-enable it.

00:14:44 ◼ ► So it went back to its default of one day retention, which is the smallest value that that version of MySQL would allow me to set.

00:14:51 ◼ ► One day binary logs in this table is too much, it turns out.

00:14:55 ◼ ► So it's 6 a.m., disk filled up.

00:14:58 ◼ ► And that's one thing you really don't ever want a Linux server, or a Mac for that matter.

00:15:03 ◼ ► When the disk is full, nothing good happens.

00:15:06 ◼ ► You have a really bad state on your hands. Lots of things break in really weird ways.

00:15:12 ◼ ► Things that aren't written to handle that can corrupt themselves, like corrupt databases and stuff.

00:15:16 ◼ ► Fortunately MySQL usually doesn't corrupt itself, but you're going to have to do some stuff and reboot, probably.

00:15:21 ◼ ► You're going to have to clear some space, that instance of the server is gone.

00:15:25 ◼ ► If you have replication set up, you're going to have to probably realign the replicas with the primary

00:15:31 ◼ ► because something will have happened, like the log will be corrupt or it will do a half write and then stop.

00:15:38 ◼ ► And then you'll have to realign it and say, "Alright, go to this log, this position, and all this stuff."

00:15:42 ◼ ► And so much of this stuff, by the way, is better with the new versions of MySQL, but we're not there yet.

00:15:46 ◼ ► So, old server was down hard because it filled up its disk.

00:15:51 ◼ ► And I thought, "Okay, best way to fix this, I already have the new server ready to go."

00:15:56 ◼ ► The only reason I hadn't switched over yet was because I was waiting until the weekend when my traffic is low.

00:16:02 ◼ ► That way if anything went wrong, it wouldn't be a huge deal.

00:16:04 ◼ ► So I thought, "Alright, fine, I'll switch over now."

00:16:07 ◼ ► It was peak time, it's not a great time to switch over, but it's peak time, but my servers are down.

00:16:13 ◼ ► So, I can switch over to the new cluster way faster than I can fix the disk space issue and reboot and get everything realigned on the old cluster.

00:16:22 ◼ ► So, I'll just switch over now. And I did.

00:16:26 ◼ ► And that's when the new server fell over completely, because I don't know why.

00:16:33 ◼ ► And here, the symptoms were lots of connections were failing.

00:16:38 ◼ ► And it seemed like every fourth or fifth connection to the database would just time out.

00:16:42 ◼ ► But the rest would be served quickly, and it wasn't under heavy load.

00:16:46 ◼ ► The server resources were not being taxed heavily. It has tons of processors, tons of RAM, SSDs, the whole way through.

00:16:54 ◼ ► And so, it's a very fast server. I was hitting some kind of bottleneck somewhere that was causing connections to, sometimes, time out and drop, but other times be served really fast.

00:17:04 ◼ ► And this is where the dark side of server administration comes in.

00:17:08 ◼ ► Server administration, when you're an indie, is very lonely.

00:17:12 ◼ ► You kind of feel like you're stranded on an island, you're trying to fix your plane yourself, and you're not an airline pilot or mechanic.

00:17:22 ◼ ► So, you're kind of stuck.

00:17:24 ◼ ► And the only resources I had at my disposal were like, "Well, I can file a support ticket with the host."

00:17:32 ◼ ► And, you know, Linode has good support, but I also know that that's going to take probably an hour or two to resolve, and I want it fixed now.

00:17:38 ◼ ► Also, you can do things like search Google for answers of like, "What happens if this connection drops?"

00:17:44 ◼ ► And that's a mixed bag. You get a lot of answers that are not for the problem you have.

00:17:49 ◼ ► You get a lot of answers that are 12 years old and don't even apply anymore.

00:17:52 ◼ ► You get a lot of answers that are just bot-created, scraped web pages that are content farms that don't actually say anything.

00:17:58 ◼ ► And you get a lot of answers from actual good people who are trying their best but are wrong.

00:18:03 ◼ ► So, it's kind of a mess, and so you kind of have to figure stuff out on your own.

00:18:08 ◼ ► I even asked Twitter, which, you know, if I asked Twitter about a server problem, you know it's really bad.

00:18:13 ◼ ► Because I don't like to do that. I don't like going there for that, for lots of reasons.

00:18:20 ◼ ► But you know I'm kind of desperate at that point if I ever have to ask, "Okay, I'm at my wits end here. I can't figure this out."

00:18:26 ◼ ► And I did get a lot of good things to check, but even then, I still couldn't figure it out.

00:18:31 ◼ ► Eventually, Linode got back to me. They had a whole staff looking at it.

00:18:34 ◼ ► They had all these admins looking at it for days afterwards trying to figure out what the heck had happened.

00:18:38 ◼ ► I could not figure it out. And the only way I could get things to resume back to normal was after a few hours,

00:18:45 ◼ ► I think it was about three hours of the service being mostly down,

00:18:49 ◼ ► I eventually found, like, well, if I just change a memcache setting over here to be way more aggressive,

00:18:55 ◼ ► to cache reads to this table way more than they were before, then I dropped the query volume by a lot,

00:19:01 ◼ ► and then it can handle it just fine. So that's what I did.

00:19:04 ◼ ► And I don't love this solution, in part because the old server handled this load just fine,

00:19:13 ◼ ► but the new server's different. It has a newer version of Linux, it has a newer version of MySQL,

00:19:18 ◼ ► it has different, probably a newer version of all sorts of stuff that might matter.

00:19:21 ◼ ► Things like the UFW firewall, I'm sure there's different connection limits somewhere.

00:19:28 ◼ ► I did verify with Linode that there's no network throughput limits or DDoS protection that might get in the way,

00:19:35 ◼ ► but I can't figure it out. And so my solution was work around it with aggressive caching

00:19:41 ◼ ► and hope that I don't introduce bugs by doing that.

00:19:45 ◼ ► But this is so often, like, the thing with running servers is sometimes you get great solutions to things,

00:19:54 ◼ ► and things just work just fine, and with all the other upgrades I did during this cycle,

00:19:58 ◼ ► like, I upgraded way ahead of my PHP version, I upgraded Nginx to a new version, PHP FPM,

00:20:05 ◼ ► the other things about MySQL that have moved to MySQL 8, there's been no other changes

00:20:11 ◼ ► besides this weird connection dropping thing that caused me problems,

00:20:15 ◼ ► like, no changes to what's supported, what's deprecated, nothing like that.

00:20:20 ◼ ► I wasn't using any of the problem areas of any of these things, so it was all fine,

00:20:24 ◼ ► except this one big issue, and it's like, well, I can't really fix this.

00:20:28 ◼ ► I've exhausted any possible things I could check within my knowledge.

00:20:33 ◼ ► Google is, I've exhausted all of Google. I've seen every page in Google.

00:20:38 ◼ ► All of them. All, like, you know, a billion pages or whatever it is.

00:20:42 ◼ ► I've seen them all. It's not there. The answer's not there.

00:20:45 ◼ ► And this just happens sometimes with servers, but, you know, my solution is, like,

00:20:49 ◼ ► I kind of fumble through, you know, I figure it out with a combination of my knowledge,

00:20:54 ◼ ► whatever people tell me on Twitter, and whatever I can find on Google,

00:20:57 ◼ ► and whatever Linode can find on their end, and I just figure it out.

00:21:00 ◼ ► And it's not a satisfying answer, but it is pragmatic in the sense, like,

00:21:05 ◼ ► well, I changed two lines of code to enable another level of caching,

00:21:09 ◼ ► and it seems to work just fine. I don't think I introduced any more bugs,

00:21:14 ◼ ► because I'm using a caching layer of my stack that I use for lots of things.

00:21:18 ◼ ► It just wasn't enabled for this table. So I think it's fixed, asterisk,

00:21:23 ◼ ► but I don't know, and I've lost a ton of sleep over it, and I'm so tired.

00:21:27 ◼ ► And every time I do this, the question comes up, like, why am I running servers?

00:21:31 ◼ ► Why am I doing this? Should I get out of this business?

00:21:35 ◼ ► And the answer is, if I was writing something brand new today,

00:21:38 ◼ ► I probably wouldn't do as much server integration as I have,

00:21:42 ◼ ► but, you know, I wrote all this stuff, like, in 2013 is when I wrote all this backend stuff.

00:21:46 ◼ ► Back then, things were different, you know. CloudKit didn't exist yet.

00:21:50 ◼ ► A lot of these managed services didn't exist or were prohibitively expensive.

00:21:54 ◼ ► Many of them still are for my kind of volume.

00:21:57 ◼ ► So I made the best decision I could at the time. Now, I think I would still run some servers,

00:22:04 ◼ ► because, again, there's so many advantages that you can do with your app to running servers.

00:22:08 ◼ ► There's so many ways you can enable cool features or cut costs that you would otherwise have to pay to,

00:22:13 ◼ ► like, third-party services or whatever, but I think I would significantly cut back the amount of data

00:22:19 ◼ ► that I have to store, because that seems to be where the problem mainly is.

00:22:22 ◼ ► Like, if I was running almost entirely application servers that were just, you know,

00:22:26 ◼ ► take requests, compute some stuff, and spit some stuff out, like, that's easy.

00:22:30 ◼ ► Those are super easy to scale. You don't have to deal with that many problems.

00:22:34 ◼ ► It's when you have to store a whole bunch of data, that's when it gets hard,

00:22:38 ◼ ► and you have to deal with weird stuff like this.

00:22:40 ◼ ► And so I think I would modify things that way, but otherwise, you just, you gotta have days like this occasionally.

00:22:47 ◼ ► It's kind of part of the game, and it's, you know, you gotta deal with it,

00:22:51 ◼ ► but the result is you get a really cool app and really cool service that works most of the time without any intervention.

00:22:57 ◼ ► So, I don't know, it's a mixed bag, but I think I'm still coming out ahead.

00:23:00 ◼ ► Well, and it's ultimately, I think, just a question of trade-offs, right?

00:23:04 ◼ ► Like, it's the thing of, oh, well, if you had been using a managed database server,

00:23:09 ◼ ► this particular problem wouldn't have happened.

00:23:11 ◼ ► But there are many other problems that you could have had,

00:23:13 ◼ ► and there are also situations I've been in where it's like,

00:23:16 ◼ ► there are problems that you can have in a managed service that there is no solution for,

00:23:20 ◼ ► that you hit some limit or something bad happens, and you have no recourse in a way that is,

00:23:26 ◼ ► you know, one of the, like, the blessing and the curse of managing it yourself is that you have the ability to go in there

00:23:31 ◼ ► and be changing stuff and fixing things and, you know, adapting things to what you're doing.

00:23:36 ◼ ► And so it's always a trade-off. Like, if you were using a managed database,

00:23:39 ◼ ► most of the time it should probably be good, but things can happen that are problematic,

00:23:43 ◼ ► and then you can be in the weird place where, like, Overcast is just down for a day

00:23:47 ◼ ► because something bad happened at your host, and you're like, okay, well, I can't do anything about that.

00:23:53 ◼ ► And, you know, it's, and so I think there's certainly an element of just, like, there is no right solution here.

00:23:59 ◼ ► It is all a question of trade-offs and finding the trade-off that is best for you and your expertise and your background.

00:24:05 ◼ ► And, you know, it's like, I think, what also I think is very encouraging with this kind of stuff is,

00:24:10 ◼ ► as there are so many different levels of the stack that you can engage in as you're developing something,

00:24:16 ◼ ► is to find the one that feels best for you. And I think for you, it's like, it makes a lot of sense that you have, you know,

00:24:21 ◼ ► managed a lot of service yourself, but at the same time, it's like you're feeling like, for right now, for you and your needs and so on,

00:24:27 ◼ ► you might be shifting one layer up slightly on the stack, and maybe you're using a managed database

00:24:32 ◼ ► rather than managing the database yourself, but you're interacting with it with application servers that you do manage yourself.

00:24:38 ◼ ► Like, I like that there's that flexibility, but it's just rough when these things happen.

00:24:43 ◼ ► And now you're in this funny place of, like, I mean, I've been there, where it's like, you get it working, and it's like,

00:24:48 ◼ ► you don't want to touch anything, because like, in some ways, what you should probably be doing is like,

00:24:53 ◼ ► you should set up another server and see if you're, you know, essentially starting again with all the learnings

00:24:58 ◼ ► that you've done to this point and do a migration, set up a replicant, and then see if that would handle the things better.

00:25:03 ◼ ► Like, because sometimes I've had the weird situation where like, it's just something went funny when you were setting up the server.

00:25:08 ◼ ► And so like, setting it up again, which shouldn't matter, magically makes it work.

00:25:13 ◼ ► But like, I don't know if you want to go down that road when things are working.

00:25:18 ◼ ► So it's a very painful thing, and I feel for you from how I've been there, but at least it sounds like you are in a place that it's not on fire,

00:25:23 ◼ ► and that is a much better place to be in, and it's painful when you're on the way there.

00:25:28 ◼ ► And I feel like the worst thing in server administration is this feeling where there are certain actions that you take

00:25:33 ◼ ► that you can't undo.

00:25:38 ◼ ► Because as soon as you've moved, like the primary database is like the worst one,

00:25:43 ◼ ► or like you shifted from one database to the other, it's not like, "Oh, I can just go back to the old database."

00:25:48 ◼ ► It's like, no, you can't, because all the data, all the writes that have happened in the meantime are just gone if you did that.

00:25:53 ◼ ► And so like, you're done. You're just making this like, irrevocable movement.

00:25:58 ◼ ► And it's like, you hope it works, and you can plan, and you can do all the things that you want to do to make it hopefully work.

00:26:03 ◼ ► And like, a little pro tip, whenever you're doing this kind of stuff,

00:26:08 ◼ ► I find paper checklists to be like the way to do it, where you're making some of these like irrevocable movements.

00:26:13 ◼ ► I always like write down the list of, "These are the 10 things I need to do in order," and then I'll like write it down and mark it off,

00:26:18 ◼ ► because you miss a step and you will explode everything.

00:26:23 ◼ ► But man, I feel for you, and I hope you can sleep again for a little bit

00:26:28 ◼ ► and find a more stable solution down the road for all this.

00:26:33 ◼ ► Yeah, I mean the good thing is like, it is mostly done. Like, I'm certainly out of the woods, but it's not done.

00:26:38 ◼ ► Like, there's still like various maintenance jobs that broke in the process.

00:26:43 ◼ ► I need to update a few things here and there, but like for the most part, it's mostly done.

00:26:48 ◼ ► And now I'm going to be in the point soon where like I just want it to touch this for a long time.

00:26:53 ◼ ► And that's fantastic. And that's most of the time is like that.

00:26:58 ◼ ► But yeah, sometimes you hit these bad days and you think, "Why am I doing this? Why?"

00:27:03 ◼ ► Is there some way I can write my app to just put all this stuff in CloudKit now or something, right?

00:27:08 ◼ ► And again, if I was starting fresh, I would revisit those kind of decisions.

00:27:13 ◼ ► But as you said, there's problems with anything you pick. Like, what if CloudKit's down?

00:27:18 ◼ ► What if it introduces a bug? AWS goes down all the time and takes down tons of websites.

00:27:23 ◼ ► There's so many issues. Or if you hit some performance bottleneck on one of these things,

00:27:28 ◼ ► a lot of times you just can't do anything about it. You're just stuck and you just got to ride around it.

00:27:33 ◼ ► So there's lots of tradeoffs here, but I'm about to reach the plateau of peacefulness.

00:27:38 ◼ ► And I'm hoping to stay there for a long time.

00:27:43 ◼ ► I hope you stay there too. Thanks for listening, everybody. And we'll talk to you in two weeks. Bye.

00:27:48 ◼ ► Bye.

00:27:49 ◼ ► [BLANK_AUDIO]

PodSearch

Under the Radar

213: A Server Disaster