00:00:00 ◼ ► Welcome to Under the Radar, a show about independent iOS app development and server problems.
00:00:06 ◼ ► And I'm David Smith. Under the Radar is never longer than 30 minutes, so let's get started.
00:00:10 ◼ ► I do not like the sound of that, Marco. Server problems, that is not the way, that is not a topic that I enjoy talking about.
00:00:17 ◼ ► So, but hopefully we can work through this and you can unload some of your frustration because I'm all too aware of that pain.
00:00:25 ◼ ► See, right now things seem pretty stable on my servers, but that was not the case during parts of the last week or two.
00:00:33 ◼ ► The thing is, like, running servers, it's this big scary thing to a lot of developers who have never done it before.
00:00:39 ◼ ► And they, you know, a lot of apps could be made way better with a server component, even a fairly light server component.
00:00:50 ◼ ► So many apps can benefit from that. So many types of features or, you know, other implementation details almost require a server.
00:01:00 ◼ ► And if you don't have your own server, then you have to go to external services for certain things, like sending push notifications to people or, you know, processing subscription payments or whatever else.
00:01:10 ◼ ► And so if you can do a lot of those things yourself on servers, you can simplify things with, you know, billing and privacy, you can usually save money because, like, doing things like push notifications yourself costs, like, nothing.
00:01:24 ◼ ► But paying a service usually costs something. And so there's a lot of advantages to running your own servers if you're an app developer.
00:01:32 ◼ ► But what many people say is, "Oh, I don't want to deal with it," or "I don't know how to do it and I'm scared of downtime or upgrades or security issues or whatever else."
00:01:42 ◼ ► And most of the time I'm able to look at those people and say, "Hey, you know what? It's pretty easy. It's not a big deal. It's not as hard as you think. Don't worry about it. You can do it."
00:01:53 ◼ ► And usually I and we encourage developers, if your app could benefit from a server component, just try to do it, try to set it up yourself, you know, set it up on frequent sponsor Linode, that's what we do, and it's usually fine.
00:02:07 ◼ ► And there's good documentation, there's good, you know, help from search engines out there on the web and Stack Overflow and, you know, the hosts like Linode and DigitalOcean, they all have their own documentation also to help people run Linux servers.
00:02:25 ◼ ► And this is when I'm good at running servers. Like for the most part, this is a qualification I have from past jobs that now running Overcast servers, it's a smaller scale than what I have operated in the past.
00:02:52 ◼ ► And the vast majority of days, I don't have to think about my servers at all. They run themselves, I have a lot of things monitored and scripted and set up in such a way that they mostly run themselves.
00:03:09 ◼ ► And a few times a year I have to go intervene, or there's some upgrade that I want to do that I have to do manually.
00:03:19 ◼ ► Or something's just getting really, really old and it's starting to become irresponsible to run software that old because it's no longer maintained to the level I want it to be maintained.
00:03:28 ◼ ► And so certain times you kind of have to go update stuff for some reason. Or maybe your database needs more disk space, so you have to upgrade it to the next size up thing.
00:03:43 ◼ ► And in many cases there is downtime associated with whenever you migrate a server to a bigger server. And so you have choices like, well, I can upgrade the server in place, but it might be down for an hour, and then my whole service is down for an hour.
00:03:59 ◼ ► Or I can do a more complicated kind of switchover thing that minimizes or eliminates downtime for my customers.
00:04:05 ◼ ► And so there's all sorts of things like that, that running servers entails decisions like this, having to deal with problems like this very occasionally.
00:04:13 ◼ ► And for the most part I'm pretty good at not having major server problems most of the time. However, this past week that was different.
00:04:22 ◼ ► Because I've been, over the last few weeks, I've been doing a kind of rolling server upgrade. I discovered a problem I wanted to fix that was related to like, I was using an old version of PHP and there were a couple other like old things.
00:04:38 ◼ ► And I'm like, you know, this would be better if I was using the newer things. And I had already done some servers on like a newer distro and stuff, and it was weird having a mixed environment, because I came across like weird bugs that would depend on like which server had processed something, because they would process things slightly differently, because they had different versions of things.
00:04:59 ◼ ► So I'm like, alright, let me consolidate, let me upgrade everything, I'll replace all of my old servers with new servers, I'll do it all live and with graceful failovers so most people shouldn't notice.
00:05:11 ◼ ► And I did and it went great. I did almost everything, totally flawlessly, and it was a great upgrade. Until it came time to move one of my databases. Now I run MySQL, MySQL does, it lacks in certain tooling areas that all the Postgres people always tell me all about.
00:05:35 ◼ ► But for the most part it's pretty solid. And I know MySQL very well, I don't know any of its alternatives at all, and MySQL has been so reliable for me over the years that I stick with it because, again, it's what I know, the tooling out there is pretty good for it, it's well understood, I know exactly what it can handle and what it can't.
00:05:56 ◼ ► And so that's what I usually use. It came time to move the MySQL server. And there was a deadline on this because there is one thing about Linode that occasionally gives me a snag.
00:06:12 ◼ ► And I'm saying that, you know I'm being honest because they're a sponsor and I wouldn't, they're not sponsoring this episode, but I wouldn't candy code it just because they're a sponsor. The only thing about Linode that ever creates work for me that is not my fault is if you run a server there for a very long time, at some point they might say, we need to move your virtual server to a different physical host because the one it's on is too old.
00:06:39 ◼ ► And we're retiring this old fleet or whatever. And my database servers are the ones I keep the longest because it's the biggest pain in the butt to move them.
00:06:47 ◼ ► So they finally, it came time that starting, I believe, tonight, or tomorrow night, they had a scheduled forced migration that my database server was going to be migrated at this time whether I want it to be or not because it had been so long and they had to retire its old hardware.
00:07:05 ◼ ► To move my database server, it's so big and the time it takes to move to a new server depends on disk size, it's so large that it was going to take like multiple hours of downtime to move it.
00:07:18 ◼ ► And I'm like, I don't really want Overcast to be forced down by multiple hours. So my solution when this problem comes up is always, okay, now is the time to migrate to a new server.
00:07:36 ◼ ► And it's probably fair to say you can also schedule it yourself in terms of they give you a deadline but you can activate it at any point between now and when you're at the end of your window is.
00:07:44 ◼ ► So if there is a convenient time to do it, you always could. And say like, okay, I'm going to do this at 2 a.m., be down for half an hour if you had a small enough disk size or whatever it is.
00:07:54 ◼ ► Right. And for almost all of my servers, they could take them down for a few hours whenever they want.
00:07:58 ◼ ► Because most of my servers have some kind of redundancy. And so like I have load balancers up front, obviously that's kind of a single point of failure.
00:08:05 ◼ ► But the load balancer spread the load between, right now I have eight different web servers.
00:08:10 ◼ ► And so that's, if one of the web servers goes down forcibly for some reason, nobody would even notice.
00:08:17 ◼ ► I would know because our sponsor, Pingdom, would tell me about it but nobody else would notice because it would just spread the request among the other ones and maybe one request might drop and that's it.
00:08:28 ◼ ► And databases, I use MySQL replication and so I have one, each database cluster has one primary and one to two, or right now one to three replicas that just replicate whatever the primary does.
00:08:41 ◼ ► And then, you know, they, and the replicas can serve read queries that don't need to necessarily be guaranteed to be up to date.
00:08:48 ◼ ► So you know, lots of like kind of large bulk tasks, like if you're counting number of subscribers to a podcast periodically, like if it's off by one it doesn't matter.
00:08:58 ◼ ► Right. So you can do stuff like read those off the replicas and save a lot of that load off the primary.
00:09:03 ◼ ► If a replica goes down, it doesn't matter. It doesn't, like the app will automatically connect to other replicas or the primary if it needs to.
00:09:12 ◼ ► And so if a replica goes down, no big deal. So there's only a few servers in my setup where if they go down I need to care and primary databases are right up there.
00:09:26 ◼ ► So I had to do something. I had to do a migration and as part of this I thought, well let me upgrade to the latest version of everything.
00:09:35 ◼ ► Because I do these upgrades so infrequently that I do use a semi-conservative Linux distro.
00:09:42 ◼ ► I use Ubuntu LTS for long term support. LTS releases I believe are guaranteed to have ten years of software upgrades after them.
00:09:49 ◼ ► And I was using LTS 18 on my most recent ones which came out in 2018. LTS 20 which came out in 2020 is now available.
00:10:09 ◼ ► Yeah, new to me for sure. I upgrade databases so infrequently that even a one or two year old release to me is new.
00:10:17 ◼ ► And I thought great, I'm on my way, this will be great. I've never had a problem with a MySQL version upgrade before.
00:10:44 ◼ ► And the new server should theoretically be faster because it had slightly higher specs.
00:10:52 ◼ ► Slightly higher specs and it was on Linode's newer infrastructure and it had all these newer like,
00:10:57 ◼ ► LTS 20 is supposed to be faster than LTS 18 because of Linux kernel changes and all this other stuff.
00:11:02 ◼ ► So it's supposed to be faster. This my friends is when our sponsor Pingdom comes into play.
00:11:12 ◼ ► Because Pingdom alerted me at like 6.30 in the morning that morning. Something's wrong. Things are down.
00:11:37 ◼ ► The good news is you can set up not only regular monitoring with Pingdom to say, "Is this page up? Tell me."
00:11:45 ◼ ► This will alert you when things like cart checkout or forms or login pages fail before they affect your customers or your business.
00:11:56 ◼ ► You can customize how you're alerted, who is alerted depending on outage severity or conditions.
00:12:12 ◼ ► Go to Pingdom.com/RelayFM right now for a 30-day free trial with no credit card required.
00:12:29 ◼ ► So, I get all these alerts from Pingdom in the way that I want it to be alerted, exactly depending on the severity of the outage, which was severe.
00:12:44 ◼ ► I will take the 6 a.m. alert rather than the "I'm about to go to sleep" alert and you're just like, "Oh, great. I'm not sleeping anymore."
00:12:54 ◼ ► But in this case, it was the old databases that I was moving off of but were still active had run out of disk space.
00:13:00 ◼ ► Because part of MySQL replication is the source main server writes a log of everything it changes, called the binary log or bin log.
00:13:16 ◼ ► That way, as long as you start them both with a consistent data state, whatever changes in the primary will change in the replica.
00:13:21 ◼ ► And the replica sometimes falls behind a little bit, so the primary has to keep the certain amount of logs around,
00:13:34 ◼ ► And that way, the replica can fall behind by up to that far and still have the data there to catch up if it can.
00:13:40 ◼ ► Well, in a server migration, I extend the amount of time I keep logs because it takes a while,
00:13:47 ◼ ► it takes a couple hours to copy all the data from one server to a new one that you're setting up
00:13:56 ◼ ► Like the initial data set takes a few hours to copy over, so I needed a few hours of retention.
00:14:01 ◼ ► The old version of MySQL that the old servers were running didn't support specifying this value in hours.
00:14:09 ◼ ► And this database, not all my databases, but this one, which stores all the episodes and all the changes
00:14:14 ◼ ► anybody makes in their RSS feeds that apply to all of their episodes, this one, it has a massive write load
00:14:26 ◼ ► And the binary logs, I had temporarily turned down the thing that automatically prunes them faster than MySQL would,
00:14:34 ◼ ► that I had some shell script I wrote, because I needed more time than that to do the initial copy to the new servers.
00:14:44 ◼ ► So it went back to its default of one day retention, which is the smallest value that that version of MySQL would allow me to set.
00:14:58 ◼ ► And that's one thing you really don't ever want a Linux server, or a Mac for that matter.
00:15:12 ◼ ► Things that aren't written to handle that can corrupt themselves, like corrupt databases and stuff.
00:15:16 ◼ ► Fortunately MySQL usually doesn't corrupt itself, but you're going to have to do some stuff and reboot, probably.
00:15:25 ◼ ► If you have replication set up, you're going to have to probably realign the replicas with the primary
00:15:31 ◼ ► because something will have happened, like the log will be corrupt or it will do a half write and then stop.
00:15:38 ◼ ► And then you'll have to realign it and say, "Alright, go to this log, this position, and all this stuff."
00:15:42 ◼ ► And so much of this stuff, by the way, is better with the new versions of MySQL, but we're not there yet.
00:15:51 ◼ ► And I thought, "Okay, best way to fix this, I already have the new server ready to go."
00:15:56 ◼ ► The only reason I hadn't switched over yet was because I was waiting until the weekend when my traffic is low.
00:16:07 ◼ ► It was peak time, it's not a great time to switch over, but it's peak time, but my servers are down.
00:16:13 ◼ ► So, I can switch over to the new cluster way faster than I can fix the disk space issue and reboot and get everything realigned on the old cluster.
00:16:38 ◼ ► And it seemed like every fourth or fifth connection to the database would just time out.
00:16:46 ◼ ► The server resources were not being taxed heavily. It has tons of processors, tons of RAM, SSDs, the whole way through.
00:16:54 ◼ ► And so, it's a very fast server. I was hitting some kind of bottleneck somewhere that was causing connections to, sometimes, time out and drop, but other times be served really fast.
00:17:12 ◼ ► You kind of feel like you're stranded on an island, you're trying to fix your plane yourself, and you're not an airline pilot or mechanic.
00:17:24 ◼ ► And the only resources I had at my disposal were like, "Well, I can file a support ticket with the host."
00:17:32 ◼ ► And, you know, Linode has good support, but I also know that that's going to take probably an hour or two to resolve, and I want it fixed now.
00:17:38 ◼ ► Also, you can do things like search Google for answers of like, "What happens if this connection drops?"
00:17:44 ◼ ► And that's a mixed bag. You get a lot of answers that are not for the problem you have.
00:17:52 ◼ ► You get a lot of answers that are just bot-created, scraped web pages that are content farms that don't actually say anything.
00:17:58 ◼ ► And you get a lot of answers from actual good people who are trying their best but are wrong.
00:18:08 ◼ ► I even asked Twitter, which, you know, if I asked Twitter about a server problem, you know it's really bad.
00:18:13 ◼ ► Because I don't like to do that. I don't like going there for that, for lots of reasons.
00:18:20 ◼ ► But you know I'm kind of desperate at that point if I ever have to ask, "Okay, I'm at my wits end here. I can't figure this out."
00:18:26 ◼ ► And I did get a lot of good things to check, but even then, I still couldn't figure it out.
00:18:34 ◼ ► They had all these admins looking at it for days afterwards trying to figure out what the heck had happened.
00:18:38 ◼ ► I could not figure it out. And the only way I could get things to resume back to normal was after a few hours,
00:18:49 ◼ ► I eventually found, like, well, if I just change a memcache setting over here to be way more aggressive,
00:18:55 ◼ ► to cache reads to this table way more than they were before, then I dropped the query volume by a lot,
00:19:04 ◼ ► And I don't love this solution, in part because the old server handled this load just fine,
00:19:13 ◼ ► but the new server's different. It has a newer version of Linux, it has a newer version of MySQL,
00:19:28 ◼ ► I did verify with Linode that there's no network throughput limits or DDoS protection that might get in the way,
00:19:35 ◼ ► but I can't figure it out. And so my solution was work around it with aggressive caching
00:19:45 ◼ ► But this is so often, like, the thing with running servers is sometimes you get great solutions to things,
00:19:54 ◼ ► and things just work just fine, and with all the other upgrades I did during this cycle,
00:19:58 ◼ ► like, I upgraded way ahead of my PHP version, I upgraded Nginx to a new version, PHP FPM,
00:20:49 ◼ ► I kind of fumble through, you know, I figure it out with a combination of my knowledge,
00:21:42 ◼ ► but, you know, I wrote all this stuff, like, in 2013 is when I wrote all this backend stuff.
00:21:57 ◼ ► So I made the best decision I could at the time. Now, I think I would still run some servers,
00:22:04 ◼ ► because, again, there's so many advantages that you can do with your app to running servers.
00:22:08 ◼ ► There's so many ways you can enable cool features or cut costs that you would otherwise have to pay to,
00:22:13 ◼ ► like, third-party services or whatever, but I think I would significantly cut back the amount of data
00:22:40 ◼ ► And so I think I would modify things that way, but otherwise, you just, you gotta have days like this occasionally.
00:22:51 ◼ ► but the result is you get a really cool app and really cool service that works most of the time without any intervention.
00:23:16 ◼ ► there are problems that you can have in a managed service that there is no solution for,
00:23:20 ◼ ► that you hit some limit or something bad happens, and you have no recourse in a way that is,
00:23:26 ◼ ► you know, one of the, like, the blessing and the curse of managing it yourself is that you have the ability to go in there
00:23:31 ◼ ► and be changing stuff and fixing things and, you know, adapting things to what you're doing.
00:23:39 ◼ ► most of the time it should probably be good, but things can happen that are problematic,
00:23:47 ◼ ► because something bad happened at your host, and you're like, okay, well, I can't do anything about that.
00:23:53 ◼ ► And, you know, it's, and so I think there's certainly an element of just, like, there is no right solution here.
00:23:59 ◼ ► It is all a question of trade-offs and finding the trade-off that is best for you and your expertise and your background.
00:24:05 ◼ ► And, you know, it's like, I think, what also I think is very encouraging with this kind of stuff is,
00:24:10 ◼ ► as there are so many different levels of the stack that you can engage in as you're developing something,
00:24:16 ◼ ► is to find the one that feels best for you. And I think for you, it's like, it makes a lot of sense that you have, you know,
00:24:21 ◼ ► managed a lot of service yourself, but at the same time, it's like you're feeling like, for right now, for you and your needs and so on,
00:24:27 ◼ ► you might be shifting one layer up slightly on the stack, and maybe you're using a managed database
00:24:32 ◼ ► rather than managing the database yourself, but you're interacting with it with application servers that you do manage yourself.
00:24:38 ◼ ► Like, I like that there's that flexibility, but it's just rough when these things happen.
00:24:43 ◼ ► And now you're in this funny place of, like, I mean, I've been there, where it's like, you get it working, and it's like,
00:24:48 ◼ ► you don't want to touch anything, because like, in some ways, what you should probably be doing is like,
00:24:53 ◼ ► you should set up another server and see if you're, you know, essentially starting again with all the learnings
00:24:58 ◼ ► that you've done to this point and do a migration, set up a replicant, and then see if that would handle the things better.
00:25:03 ◼ ► Like, because sometimes I've had the weird situation where like, it's just something went funny when you were setting up the server.
00:25:18 ◼ ► So it's a very painful thing, and I feel for you from how I've been there, but at least it sounds like you are in a place that it's not on fire,
00:25:23 ◼ ► and that is a much better place to be in, and it's painful when you're on the way there.
00:25:28 ◼ ► And I feel like the worst thing in server administration is this feeling where there are certain actions that you take
00:25:43 ◼ ► or like you shifted from one database to the other, it's not like, "Oh, I can just go back to the old database."
00:25:48 ◼ ► It's like, no, you can't, because all the data, all the writes that have happened in the meantime are just gone if you did that.
00:25:58 ◼ ► And it's like, you hope it works, and you can plan, and you can do all the things that you want to do to make it hopefully work.
00:26:08 ◼ ► I find paper checklists to be like the way to do it, where you're making some of these like irrevocable movements.
00:26:13 ◼ ► I always like write down the list of, "These are the 10 things I need to do in order," and then I'll like write it down and mark it off,
00:26:33 ◼ ► Yeah, I mean the good thing is like, it is mostly done. Like, I'm certainly out of the woods, but it's not done.
00:26:43 ◼ ► I need to update a few things here and there, but like for the most part, it's mostly done.
00:26:48 ◼ ► And now I'm going to be in the point soon where like I just want it to touch this for a long time.
00:27:03 ◼ ► Is there some way I can write my app to just put all this stuff in CloudKit now or something, right?
00:27:13 ◼ ► But as you said, there's problems with anything you pick. Like, what if CloudKit's down?
00:27:18 ◼ ► What if it introduces a bug? AWS goes down all the time and takes down tons of websites.
00:27:23 ◼ ► There's so many issues. Or if you hit some performance bottleneck on one of these things,
00:27:28 ◼ ► a lot of times you just can't do anything about it. You're just stuck and you just got to ride around it.
00:27:43 ◼ ► I hope you stay there too. Thanks for listening, everybody. And we'll talk to you in two weeks. Bye.