00:00:14 ◼ ► So I think as we mentioned in our last episode, we're going to be starting to go to publishing roughly every two weeks from now on,
00:00:22 ◼ ► largely just coming from a place of, I think the show is mature enough at this point that
00:00:31 ◼ ► And I think we were getting a little thin on topics, and we'd rather have fewer episodes that had richer content than to kind of
00:00:45 ◼ ► And we're going to start off this year kind of in some ways wrapping up a topic that we talked about on the last episode where
00:00:57 ◼ ► I was largely talking about dealing with bugs and challenges and sort of ways and means by which to work that out.
00:01:06 ◼ ► And I had an interesting update. Usually we don't do follow-up on the show, but I had a little interesting update on the bug that I talked about last week,
00:01:12 ◼ ► and then we're going to dive into some more different kinds of bugs and issues that we may have run into over Christmas.
00:01:18 ◼ ► So it turned out I was eventually able to find the bug that was driving me crazy in the episode where I was talking about
00:01:26 ◼ ► how to reproduce unreproducible bugs. And everything I had tried, there was an issue in Pentameter++
00:01:34 ◼ ► where all of a sudden the step counts would go crazy, and suddenly rather than having like 10 or 20 or 30,000 steps like people normally have,
00:01:45 ◼ ► Like every time they opened the app, the number would double. And the doubling, the fact that it was pretty much actually doubling,
00:01:51 ◼ ► was certainly a clue that there was some, like somehow this data was getting compounded on itself.
00:01:55 ◼ ► But I could not reproduce this. I tried all manner of things. I tried, you know, like, inventing bugs, trying introducing delays,
00:02:04 ◼ ► like over, you know, having the saving logic run thousands of times. Nothing I could do to get it to work.
00:02:12 ◼ ► Then, right before Christmas, I finally had the final clue that I needed to track it down.
00:02:19 ◼ ► And the clue came in the form of a screenshot that somebody had sent me where they were having this bug,
00:02:26 ◼ ► and they sent me a screenshot, like, of their phone and of their watch to say, like, you know, like showing me that the numbers were crazy in both places.
00:02:34 ◼ ► And what is interesting is that I noticed that the Apple Watch had a slight visual difference to the phone that shouldn't have been there.
00:02:46 ◼ ► So in a previous version of the app, the app had been updated to slightly change the font on the Apple Watch.
00:02:53 ◼ ► And so it changed the shape of the comma. And I noticed that the comma was still the wrong shape,
00:02:58 ◼ ► which meant that the Apple Watch was running the, an old version of the app, but it was talking merrily for days on end with the new version of the app.
00:03:09 ◼ ► And that ended up being the root of the problem and ended up being why it was impossible for me to recreate.
00:03:15 ◼ ► Because for some, like, that's not a situation that, like, I can just like go into Xcode and be like,
00:03:25 ◼ ► Like, that's not a mode that exists. This isn't something that I think should ever happen.
00:03:35 ◼ ► And so I was hitting this weird situation where the way that, when I changed the way the step counting worked,
00:03:40 ◼ ► suddenly the two versions of the app, the old and the new, would talk to each other and they would disagree about what the step counts should be.
00:03:48 ◼ ► And because I have all this logic in the app for making sure that I never decrease step counts.
00:03:53 ◼ ► If one device says the count was higher, I always let that device win because I never want to show someone that,
00:04:07 ◼ ► And because of those two things, it combined to create this situation where things just kept doubling out of control.
00:04:12 ◼ ► And it turned out that was why. And I coincidentally actually had fixed this because I had versioned one of the fixes in my, like,
00:04:25 ◼ ► One of the things I did was I introduced versioning to my watch communication scheme so that it would only,
00:04:33 ◼ ► like, versions of the app would only talk to themselves because I'd had, I'd never seen it, but I was like,
00:04:38 ◼ ► in my guessing of what could possibly be the case, I had said, "Well, maybe old, like, old messages from an old version of the app could be like pending.
00:04:51 ◼ ► So, like, I never expected that the two versions of the app could happily coexist for, you know, days or weeks on end.
00:05:01 ◼ ► So I added versioning. It turns out that I did actually fix it, and so everything was fine.
00:05:06 ◼ ► And that's the, I guess the moral of the story is don't trust watch connectivity, which I probably knew to start with.
00:05:13 ◼ ► That's a good life lesson in general. No, that's wild. I never would have even considered the possibility of a different version
00:05:27 ◼ ► And it makes sense, like, you can think of scenarios where that might happen, like, you know, where the phone app gets updated,
00:05:33 ◼ ► but it hasn't communicated with the watch app yet, and then they get back in range or whatever,
00:05:38 ◼ ► and then the watch app starts sending messages before the phone app has had a chance to upgrade it or something.
00:05:44 ◼ ► Like, you know, I don't know the full details of how that system works, but, like, it's the kind of thing I would just never have thought of.
00:05:50 ◼ ► Yeah, wow. So I pass this on as a cautionary tale. If you're doing anything with watch connectivity,
00:05:56 ◼ ► add versioning to your watch connectivity messages to make sure that you don't get bit by this as well.
00:06:02 ◼ ► Yeah, I'm gonna do that today. I'm gonna go add my build number to the message dictionary I pass back and forth.
00:06:13 ◼ ► But I hear I was not the only one who had a fun and exciting Christmas time with bugs and challenges.
00:06:21 ◼ ► You know, sometimes running servers, you can go for months or years without having any problems at all.
00:06:29 ◼ ► And then you have a problem at just the worst time, and it makes you think, "Man, I should never run servers again.
00:06:34 ◼ ► Like, I should get out of this business." But, you know, the reality is, you know, if you set things up well and you don't have any weird things happen to you,
00:06:42 ◼ ► it's generally pretty fine. Like, it's not a lot of work, as we've talked about before.
00:06:46 ◼ ► Like, you know, we've talked about how relatively easy and low effort it is to run servers these days.
00:06:50 ◼ ► But occasionally, something goes very wrong, and that's what happened to me over Christmas.
00:06:55 ◼ ► So, a lot, you know, a little bit of backstory. A lot of Overcast customers had been reporting crashes recently.
00:07:05 ◼ ► And so I've been really aggressively working on a series of updates and beta tests to try to find and fix as many crashes as possible.
00:07:14 ◼ ► And so one of the more obscure crashes that kept coming up, and I couldn't figure out why,
00:07:19 ◼ ► it was happening in some very complex, grand-sensual dispatch code. It was using dispatch timers to coalesce and throttle the rate at which the app makes sync requests to the server.
00:07:33 ◼ ► The point of this was to make syncs happen reasonably quickly after data changed, but to not make too many requests in a short time,
00:07:41 ◼ ► and to also dynamically throttle back the frequency if you had a low battery level or a low power mode was awesome, or whether you were on cellular.
00:07:53 ◼ ► Make things sync quickly when they're happening most of the time, but then throttle it back so it doesn't go too heavy and doesn't, like, hammer the server or use up too much battery or whatever else.
00:08:02 ◼ ► And so that code, it was using this complex dispatch timer thing that I had based on something forever ago in a Mike Ash blog post.
00:08:10 ◼ ► So, you know, pretty advanced code that I didn't really fully understand, if I'm honest.
00:08:16 ◼ ► Yeah, it's very advanced. It's a great, great resource, very advanced. A little over my head at times, but anyway.
00:08:23 ◼ ► So there was a crash happening with some of the dispatch calls in that code, and I couldn't figure out why, what conditions would lead to that crash.
00:08:30 ◼ ► So as one of these builds right before Christmas, I rewrote that code to just be simpler, to just use, like, NS timers and to try to avoid whatever condition was causing crashes there.
00:08:42 ◼ ► Because the crashes there were happening, like, with every build I would get reports of it. It would be one of my top crashes.
00:08:46 ◼ ► So I figured, I'll find a way to fix this. I'll just rewrite that code to be simpler so I can understand what's going on and to not be so sophisticated.
00:08:54 ◼ ► And right before the App Store Connect holiday shut down for Christmas week, that code shipped as version 505 with a bunch of other fixes.
00:09:07 ◼ ► They live somewhere with no cell coverage from any carrier, not Verizon. There's no cell coverage.
00:09:14 ◼ ► And they have the best internet connection they can buy up there, which is a 6 megabit DSL.
00:09:20 ◼ ► And that's only 6 megabits down, of course. It's 768K up. And this costs them $80 a month.
00:09:26 ◼ ► And this was being shared that week, because it was a house full of family, between 7 people and about 15 devices.
00:09:38 ◼ ► Now, to make a slow internet connection usable to a lot of people, the router can't let any one device monopolize the connection, especially the upstream.
00:09:46 ◼ ► If the upstream is saturated on your internet connection, all those TCP ACK packets from your downloads can't get back out.
00:09:52 ◼ ► So it basically kills downloads as well as uploads, and it makes the connection basically feel offline.
00:10:01 ◼ ► And when you have 7 people taking 12 megapixel photos and 4K videos on their iOS devices,
00:10:08 ◼ ► and iOS devices automatically upload those to iCloud whenever they feel like it or whenever they are plugged in,
00:10:19 ◼ ► So what routers do to manage this is they perform QoS, which stands for Quality of Service.
00:10:23 ◼ ► This limits how much of the downstream and upstream bandwidth any one device can monopolize.
00:10:28 ◼ ► And when you have less bandwidth than you need, QoS on the router is basically a godsend.
00:10:32 ◼ ► It makes the difference between the connection being usable for everyone and one person's iCloud upload making it go offline for everyone else.
00:10:38 ◼ ► So a few years ago I upgraded them to a Nerdy Ubiquity router, the same kind I use here at home.
00:10:44 ◼ ► It has pretty good QoS, it made a huge difference, and it would actually remain usable most of the time, even when you had iCloud uploads going.
00:11:00 ◼ ► So miracle number one, so I had this year to solve some range issues with their previous setup,
00:11:18 ◼ ► And it's always on, you don't have to set it up, it's typically Eero-style, I don't even think you can configure it, like it's just on.
00:11:32 ◼ ► Like my terminal windows were less laggy, like everything was way more responsive, even with all these people using the connection.
00:11:38 ◼ ► So that was Christmas miracle number one, and then I soon started needing a lot more miracles.
00:11:44 ◼ ► But first, we're sponsored this week by Linode, which, spoiler, ended up being one of my Christmas miracles.
00:11:51 ◼ ► So Linode gives you access to a suite of powerful hosting options, prices start at just $5 a month.
00:11:57 ◼ ► You can be up and running with your own virtual server in the Linode cloud in under a minute.
00:12:01 ◼ ► Even on that slow connection, if you're on your parent's DSL, you can still get it to go in under a minute.
00:12:14 ◼ ► They know how important it is to get the help you want, and they have amazing documentation that's open to the public.
00:12:19 ◼ ► You don't have to be a member, you can see all the documentation to give you a reference when you need it.
00:12:23 ◼ ► And let me tell you, their support saved my behind, and I'll tell you why in a few minutes.
00:12:28 ◼ ► But Linode is just fantastic. They have great service, great support, a great control panel.
00:12:37 ◼ ► Their plans start at 1GB of RAM for just $5 a month, and they offer lots of plans above that for whatever your needs may be, including high memory plans.
00:12:58 ◼ ► So go to linode.com/radar to learn more, sign up and take advantage of that $20 credit.
00:13:06 ◼ ► So the day after Christmas, I started getting reports of very slow syncs, sync errors, and searching not working.
00:13:15 ◼ ► And this is really a problem, like searching is especially, that's a huge problem, because if you can't search, you can't add podcasts.
00:13:21 ◼ ► And that's a big problem for a podcast player getting lots of new installs for Christmas.
00:13:30 ◼ ► Now, when you're on a really slow internet connection, it's really hard to verify if your servers are slow, because you can't really feel it on your own device.
00:13:44 ◼ ► But since I had installed that Eero system, and it improved the responsiveness so much, I could actually tell that other things did seem much faster.
00:13:58 ◼ ► Christmas Miracle #2 was a service that has been a past sponsor that I've also used forever called Pingdom.
00:14:04 ◼ ► So Pingdom can notify you when stuff goes down, but it also maintains performance logs from all these, it does status checks from around the world.
00:14:16 ◼ ► So I was able to log into Pingdom to get a more objective view of what is performance from my site from around the world, not just from my slow DSL connection I'm using right now.
00:14:25 ◼ ► And so sure enough, the response times most of the time were very constant, around 400 milliseconds.
00:14:31 ◼ ► And then at the time that everything was feeling slow, it had gone from 400 milliseconds to about 7 seconds.
00:14:54 ◼ ► And this was invaluable to get this info from Pingdom, because I was able to actually see what is the actual state, not just how does it work for me, and to show me a history of what was normal before.
00:15:09 ◼ ► And in fact, I was also able to see it wasn't because of the US workday, because that history also showed me that latency was pretty much the same all the time.
00:15:17 ◼ ► It didn't vary with the day cycle or the weekday or weekend cycle or anything like that. It was pretty constant.
00:15:23 ◼ ► Now, I was a little nervous, because Overcast's peak usage does line up with the US workday.
00:15:30 ◼ ► And so I was like, "Oh no." Right before Christmas, I changed the sync timing algorithm.
00:15:38 ◼ ► So I thought, "Oh no." This was the day after Christmas. This was the first full-blown US workday for a lot of people since the update had spread to a lot of devices.
00:15:47 ◼ ► So I thought, "Oh no, I messed up sync timing, and I'm like DDoSing myself from every copy of my app out there."
00:15:56 ◼ ► So the first thing I did, I very quickly rewrote the sync throttler, and I wrote it even simpler this time, even easier to understand, a little cleaner code separated out into its own class and everything.
00:16:07 ◼ ► And I had two main goals, to be less fancy so that I can more understand the conditions that can happen, and to have remotely configurable parameters that I could adjust with server-side code.
00:16:24 ◼ ► So I got that done very quickly. I sent it to beta testers immediately, because I knew I'd have to submit it basically that day, because it was so urgent.
00:16:32 ◼ ► Now, by the way, uploading builds to App Store Connect with a 768k upstream is painful and kind of hilarious.
00:16:40 ◼ ► Because it doesn't just send the app code. The app is whatever it is, like 20 to 30 megs.
00:16:45 ◼ ► But it also sends debugging symbols and bitcode and all the different assets for all the different size classes and everything.
00:16:55 ◼ ► So I uploaded one, I was all set to submit it, and then one of the beta testers reported that I broke the sleep timer.
00:17:23 ◼ ► And so I filed the F-sided review. But I'm like, there's probably a lot of people filing these right now for this, you know, problems that happened this week.
00:17:33 ◼ ► So I'm like, I'm last in line, I'm filing a request that lots of people have filed, this is probably not going to go well.
00:17:45 ◼ ► So the weird thing was that there was no indication that my servers were under heavy load.
00:17:58 ◼ ► So I first started out like, let me look at the web server performance, it doesn't seem like they're being bottlenecked by the databases, let's see what's going on with web servers.
00:18:06 ◼ ► And also that would explain why search requests were also a problem, because search hardly touches the databases.
00:18:16 ◼ ► Because Linode, you know, some problems you need to throw resources at and just see if that fixes it.
00:18:21 ◼ ► And with Linode, like, you know, it's virtual servers, you can resize them up and down with different, you know, hardware amounts.
00:18:29 ◼ ► So I thought, attempt number one, if my web servers might have performance problems, just double all their resources and see what happens.
00:18:36 ◼ ► So I did a rolling upgrade across them over about an hour, doubled all the capacities, and I knew, you know, no big deal, like, you know, they also do hourly billing.
00:18:44 ◼ ► So it's like, if I need to roll it back, I'm only out the cost difference of like the two hours I had the bigger sizes.
00:18:54 ◼ ► Like, I doubled everything on the web servers, the server upgrades didn't improve any of the, they didn't fix this problem.
00:19:04 ◼ ► So like, obviously, so like, if the servers are not under load, but everything's slow, there's some kind of bottleneck somewhere else.
00:19:14 ◼ ► And at first I thought, you know, let me look at my code, see like what requests are taking the longest in code, what functions are taking the longest, what database queries, maybe something.
00:19:32 ◼ ► I have my own like time logging thing built into time certain database queries, so I tried using that, but it wasn't revealing anything obvious.
00:19:40 ◼ ► And so I did some searches, and a mini Christmas miracle, Stack Overflow, which is always a miracle every day, led me to the two most recommended solutions, which were the New Relic service, which is a commercial service, it's all their thing and it monitors, and a PHP plugin that's been around forever called Xdebug.
00:20:00 ◼ ► New Relic has major advantages in theory, like they will aggregate things over multiple requests, and they have all sorts of metrics they collect and everything, they have context with historical data and everything, but I could not for the life of me figure out how much it would cost me.
00:20:13 ◼ ► Like their pricing for using it on servers is very confusing, and their website was totally broken. Like I couldn't actually, like the calculators kept breaking, and then at one point I'm like, fine, I'll throw money at it.
00:20:24 ◼ ► I'm desperate, I need to fix this now, so I'll pay whatever you want me to pay, and so I tried signing up, and their signup form was broken.
00:20:32 ◼ ► So I'm like, alright, well, I guess that's not gonna be an option, so I set up Xdebug on one of my servers.
00:20:38 ◼ ► It's very, very basic, it's a PHP extension, you enable it, you restart the server, so you let it run for a while, and it dumps logs to /temp, and it writes like one log file per request, and you can just go in and profile that request and see what happens.
00:20:52 ◼ ► There's no aggregation as far as I can see, as you can see, like what requests are taking up all the time in total?
00:21:02 ◼ ► But Christmas Miracle number four was a front end called WebGrind. It's a simple open source package that lets you view and navigate the Xdebug files in a web browser, instead of using some kind of weirder Linux tools.
00:21:15 ◼ ► So WebGrind was awesome, and it actually did let me find some slow functions and bottlenecks, so I took the next few hours, or most of that day, and just made tons of stuff faster.
00:21:25 ◼ ► Ridiculous amounts of things faster. Like one example is, I have a handful of string utilities that, you know, some of PHP's function naming and calling conventions are kind of unintuitive.
00:21:37 ◼ ► Like to see if a string contains a substring, you have to call strpos, and you can never remember what order the parameters go in, and then you have to test it against, does it not equal equal false?
00:21:49 ◼ ► So it's like, that's weird for me. And so I have a function in my standard utilities library that I put in on my project called contains, that contains a b, and all it does is call strpos does not equal equal false, but it makes the code more readable.
00:22:06 ◼ ► Turns out, I learned from profiling, that calling, like, when you call your own PHP user space function, that then calls a system function, it's way slower than just calling a system function directly.
00:22:18 ◼ ► It's not a little bit slower like it would be in C, it's a lot slower, because it's calling like one more like PHP space function first.
00:22:24 ◼ ► And calling the built-in functions, which are all implemented in C, are way, way, way faster, even just like the call differences.
00:22:31 ◼ ► So I found a few hotspots that used a few of my little utilities like this, and I converted them back to the ugly PHP code.
00:22:41 ◼ ► Fixing memcache misses that should have been hits and everything, but eventually I got to the point where my profiling was basically showing like, nothing in particular is taking a lot of time.
00:22:54 ◼ ► Then finally, I thought I'd check one more place, the load balancer. I use, it's Linode's managed load balancer called the node balancer, which it's basically equivalent to running like, you know, Nginx or HAProxy on a dedicated machine, but they just take care of all the configuration for you.
00:23:11 ◼ ► You do it all on the front end, and they take care of SSL decoding and everything, so it's really nice actually. That's why I use it instead of just doing HAProxy or something.
00:23:19 ◼ ► And two things jumped out at me when I looked at the load balancer. So it performs health checks on every back end web server, and I configure those health checks to fail if the web server can't be reached, or it can't reach the database, or it's above a certain load average.
00:23:34 ◼ ► And node balancer was not showing failed health checks any time recently, which suggested that the web servers hadn't really been slammed at all since the problem began.
00:23:44 ◼ ► Like, they hadn't had their high load average to trip that health failure. So the connection rate graph was the other interesting thing.
00:23:52 ◼ ► Normally it was about three or four hundred per second, and it was currently spiked and plateauing at around ten thousand.
00:24:00 ◼ ► I remember reading a while back that node balancers have a limit of ten thousand concurrent connections.
00:24:05 ◼ ► So somehow the load balancer was getting twenty-five times its usual traffic, but only during the US work day.
00:24:14 ◼ ► And it's like, this is weird, because at about eight pm it fell back down to its usual level. So even without the app update being in people's hands, in the evening it was back down to exactly what it was before.
00:24:26 ◼ ► Not higher than it was before, exactly what it was before. And then the following morning, eight am rolls around and it goes back up to ten thousand again.
00:24:34 ◼ ► And again, everything was the same. Web servers didn't appear to be loaded. Something was happening.
00:24:38 ◼ ► So anyway, Christmas miracle number three, Linode, came to my rescue again. I filed a support ticket, and I'm like, hey, can you look at this? Like, I don't know what to do here.
00:24:48 ◼ ► Something weird is going on with the load balancer, but it doesn't seem to be getting to the back ends.
00:24:51 ◼ ► And they gave me a full response in less than an hour. And it was a detailed, it wasn't just like a form, oh we'll get to it soon, or try this.
00:24:59 ◼ ► No, it was like, they looked at the packets that were going in, they suggested a few things, they said it was getting fifty thousand packets per second, but that it didn't appear to be like an obvious DDoS, but they said they couldn't rule it out.
00:25:13 ◼ ► So the tech suggested putting Cloudflare in front of Overcast, which is, it's like a reverse proxy acceleration service.
00:25:20 ◼ ► Having it send traffic to a new node balancer, whose IP would never be in a public DNS entry, so like people couldn't bypass Cloudflare and just go to the node balancer directly, and then shutting down the old one.
00:25:35 ◼ ► And I was a little nervous, honestly, about this option, because like putting a caching proxy service that I had never used in front of my entire API from an app that does SSL certificate pinning, and doing this in a hurry while on vacation with limited testing abilities on slow DSL, like this was very concerning to me.
00:26:06 ◼ ► It was, I can't say enough good things about how easy it was to set up, like it was incredibly easy to set up, they really have done a very good job there.
00:26:12 ◼ ► And I did their plan, they have a plan that lets you upload your own certificate for SSL, so I did that to avoid pinning problems with the app.
00:26:24 ◼ ► So as the DNS changed, so you know, you change your DNS to point to Cloudflare instead of you, and as that change propagated, within like a couple hours, the pattern was very clear that like the traffic going to the old load balancer was still the giant flood of packets for the next few hours.
00:26:40 ◼ ► And the traffic that was coming through Cloudflare, and that was getting to my second new load balancer, was totally normal.
00:26:50 ◼ ► And so it went from, you know, crazy, horrible, unusable, to over the next few hours, the problem was gone.
00:27:03 ◼ ► My fixed version was approved, and it spread through most of the user base over, you know, a couple of days.
00:27:16 ◼ ► I don't know whether it was my fault, whether it was the App Review, like was my sync throttling like going horribly wrong?
00:27:40 ◼ ► I didn't have time to scientifically test one factor at a time, you know, isolate everything else, control everything else, just test this one thing, you know, to figure out what the exact problem is.
00:27:53 ◼ ► Like, you have limited capacity, like I was on vacation, it was a weird place, I was urgent timing-wise.
00:28:14 ◼ ► But that's what happened, and it kind of inspired me to do a little bit more server work now.
00:28:25 ◼ ► Basically, this actually might end up being two weeks from now episode, because it's too late to fit it in here.
00:28:31 ◼ ► But just doing a whole lot of server updates, because like, one of the issues I faced during this big problem was,
00:28:45 ◼ ► I've been relying on some old versions of some of the packages and the distros that just like aren't supportive anymore,
00:28:50 ◼ ► and make it hard to start up new servers, and cloning is not something I like doing very much.
00:29:06 ◼ ► - Yeah, it was finally fixed, and I was able to enjoy the rest of the vacation after that, but
00:29:15 ◼ ► but the fact is, like, the problem is gone, I'll never know, and sometimes that happens.
00:29:25 ◼ ► I didn't know which thing I had done fixed it, but I didn't really care at a certain point, because it's like, it's fixed.