ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

ShipTalk - Don't Let Efficiency Nuke Your Reliability - Matt Schillerstrom - Harness

February 16, 2023 Jim Hirschauer / Matt Schillerstrom Season 2 Episode 0
ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery
ShipTalk - Don't Let Efficiency Nuke Your Reliability - Matt Schillerstrom - Harness
Show Notes Transcript Chapter Markers

In this episode of ShipTalk (The SRE Edition), Matt Schillerstrom shares stories from his time working in reliability at a nuclear power plant and at Target. We've changed up the format of ShipTalk to include some sections that are just for fun...

  • Introductions
  • Just for fun #1 - 1 truth and 1 lie
  • Main topic - The impact of efficiency initiatives and how to avoid/minimize the negative consequences
  • Just for fun #2 - Matt's worst IT mess up
  • Closing
Jim Hirschauer (Host):

All right. Welcome to Ship Talk, the SRE edition. I'm Jim Hirschauer, host of the SRE Edition, and a recovering techie who likes to talk to other people about technology. Now, Ship Talk is a podcast sponsored by Harness, the software delivery platform, and the SRE Edition is all about reliability topics. My guest today is Matt Schillerstrom who also happens to be my coworker. The other day, matt and I, we were talking about the impact that all of the current corporate efficiency initiatives might have on the reliability of applications, and Matt had some really great insights that I thought he could share with us. So Matt, welcome to the show.

Matt Schillerstrom (Guest):

Yeah. Hey Jim. Glad to be here for the first chat here. Yeah, we've been working together for what now? A little less than a year. We've had a lot of fun.

Jim Hirschauer (Host):

So yeah. And we get to spend a ton of time together and have some really interesting conversations. So I know a lot about you, Matt, but I'd love for you to take a minute to go ahead and share your tech background with our listeners.

Matt Schillerstrom (Guest):

Yeah, for sure. I've been in tech for about 20 plus years. I started out my career actually at a nuclear power plant. When I graduated college, I thought I was gonna get into the leading edge technology, but I found myself working on like 1980. General Electric, like old computers, like with backs and VMs, operating systems, like the true green screen old school. Yeah. But come on, I I wouldn't take that back. Yeah, I was at the power plant for 13 years, just doing like IT system work. Administration. I had four or five years where I was actually in engineering doing reliability testing on pumps and motors. But then I got back into the power plant control system and I was an administrator of the actual controls that the operators would use to run the power plant. So that was cool.

Jim Hirschauer (Host):

Yeah. Really important reliability at the power plant.

Matt Schillerstrom (Guest):

Life or death. Yeah, exactly. You know, is that gonna ra, is it gonna leak radio activity or not? Right, right, right. But yeah, then that was it. Target a big retailer out of Minneapolis, Minnesota for five years as a lead engineer building out their chaos engineering program. So it was like an old school IT disaster recovery team, and then we revamped it into doing real-time disaster recovery, utilizing chaos engineering.

Jim Hirschauer (Host):

Awesome. That's quite a background you have, so some serious life or death, reliability work that you've done and then maybe not quite life and death, but serious business that you're involved with. So, great background. Yeah. So listen, before we get to our main topic, I'd like to just play a little, little game here to start out and have some fun. I'd love for you to, to tell me one truth and one lie, and I'll try and guess which one is which, based on what I know.

Matt Schillerstrom (Guest):

Okay, So truth and a lie. Let's see if I can fool you. I'm an avid saxophone player. Played all my life, marching band, jazz band, wind ensembles, or I've actually never had to do Kubernetes production rollout by myself.

Jim Hirschauer (Host):

Okay. For some reason, I think, I think you're a musician. I don't know if, if so, what's jogging my memory about that. But remember, I know you, this is gonna be a little bit easier for me, but, but that's my guess. My guess is you're, you're a musician, you're a saxophone player. Yeah. Yeah.

Matt Schillerstrom (Guest):

It's nice. That's one quirky thing about me, I guess.

Jim Hirschauer (Host):

So. Cool. Well, it's great to have hobbies outside of it, right? Yeah. Yeah. All right, cool. Well, let's, let's jump into our main topic now. So the reason that I wanted to have this conversation, Is because companies are cutting back today, right? Right now with the current economic environment companies are looking for efficiencies and those efficiencies can potentially come with compromises. And we were having a conversation about this and you shared really interesting story with me. So I think this was back to your retail days. Why don't you go ahead and, and and share that story one more time.

Matt Schillerstrom (Guest):

Yeah. I think what's interesting just with this story in retail, Like it relates a lot to my power plant days because at a nuclear power plant, you have defense in depth. You have multiple redundancies built in. And as an example, like if I were to take you on a tour of a power plant and you were to point, what is this? What is this, what is that? It'd all be for like failover or safety, like hardly anything in the power plant. Like big turbine itself is what makes money. Everything else is there to safely shut it down. And when I look at my experience, like at the retail stores, right, and environments we would often like over buy like over capacity, what we really needed leading up to peak seasons with like Cyber Monday or you know, the holiday season. And what that meant was like a month before, like we expected, you know, thousand x of traffic. We would actually just be willingly paying millions of dollars in extra infrastructure just to have that capacity and confidence that our system would be scaled appropriately to handle. Right. And if you think back to like 20 18, 20 19, like, you know, the economy was really good and, you know, spending that, you know, millions of dollars extra on infrastructure. Didn't really matter, right? Yeah. Or it maybe it mattered, but like people didn't know how to look at the granularity of it. Cuz bin ops wasn't as like, you know, popular or well known back then either. Right? Yeah.

Jim Hirschauer (Host):

So how, so how many times a year would you say that your company was over provisioning like this?

Matt Schillerstrom (Guest):

Yeah. So at my retail like company, we would spin up, you know, multiple times a year over provision, over capacity. But again, we had that confidence that we would be able to perform and withstand, you know, disruption in event because like if we had network loss or you know, peak unexpected spike in traffic we bought assets that it should be resilient through. Chaotic Storm. Right. And it, and it was, you know, we were, we were rock solid generally speaking. And what was interesting, like leading into the pandemic and covid, all, a lot of these retailers, especially, you know, like grocery stores, they were running peak load almost every day with, you know, delivery and on demand shopping. Mm-hmm. And what was interesting about that too is that because they were running this well-oiled machine at peak capacity, like over provision, like they were just rock solid and stable. Right? Reliability was there like a hundred percent. And again, they were paying more infrastructure, but they were also making a lot more money at the time. Right. And it, they didn't really have to think about like that efficiency as much. Just because. It was working.

Jim Hirschauer (Host):

Yeah. And you can do, you can do that when business is good, but what happens when times get tough and lean?

Matt Schillerstrom (Guest):

Yeah. Yeah. And I think, you know, as you saw, you know, after the pandemic and more people getting out in the open and not doing as much online shopping, like retailers in general had to scale back their infrastructure, right. To like normal capacity, like pre covid times. And what you saw, like, and you can see this in the like internet if you search outages too. The, the infrastructure that they had, that they provisioned it, it didn't meet, you know, that reliability standard that they had throughout, like, you know, these high provisioned time periods over Covid, right? Because they scaled back from an EC economic perspective, but also they hadn't really tested, you know, everything that they probably should have going back to the right sized infrastructure during that time period. So then you. Ma, you know, massive outages that we saw even from aws, right. And Google and other cloud providers.

Jim Hirschauer (Host):

Yeah. So ultimately, what is the, you know, what do you recommend as a, as a good way to fix that, to address that and to deal with

Matt Schillerstrom (Guest):

that? Yeah, I think, I mean, when you look at 2023 here, you have to have the right amount of testing and like, what does that even look like? You besides having the right tools, like you need the right people with the right like checklist of how you're gonna like, handle like your business logic. So taking a step back, like what that looks like is working with your business, like your product leaders that are driving like those business initiatives, just to understand the service level objectives, right? That lead up to the service level agreements because at the end of the day, It's a business decision really, on how much you wanna spend for that, like customer transaction, right? I would always joke back in the day about selling bananas, and you could spend a lot of money on infrastructure to support selling bananas online, right? But at the end of the day, like the business can decide, you know, like, well, if, if x happens, if X disruption happens either in the supply chain or from a technology perspective, let's, let's do this right? and an engineer might overscale that infrastructure or business process. But if, if, again, if you talk to the business and kind of learn about the consequences and then you can just make those logical decisions to kind of like, you know, provision just enough to get through through that transaction, but then fail, you know, otherwise.

Jim Hirschauer (Host):

Yeah. So it's really, it's really about understanding those failure modes, right? Like setting up the right targets, or you mentioned SLOs, so I'll just say reliability targets in general. Yeah. And then you know, how do you test for all those different modes of failure? And it's, that's, that's kind of the whole purpose of the ca practice of chaos engineering, right?

Matt Schillerstrom (Guest):

Yeah. Yeah. So where it ends up, you know, like, so like chaos engineer. Today it sounds like a daunting task or a cultural movement, but really like where we've shaped it to be at Harness is a simple task that you can run in your deployments, right? So you can run a series of tasks against specific use cases that you want to prove that you're like resilient through. Right. And if companies out there are provisioning new workloads to meet certain capacities, they can simply execute some of these tests to see how their system would, you know, handle that failure, but also how it would perform for their customer. Right. What, at the end of the day, what's that customer experience look like? and then they can choose, you know, is that good enough or not?

Jim Hirschauer (Host):

Yeah. So, you know, SLOs, reliability targets, those are supposed to be a good measure of customer experience. So is that what you would use, is that a good measure for determining if your chaos experiments or are, you know, passing or failing?

Matt Schillerstrom (Guest):

Yeah. You know, like an ideal situation if you have your service level objectives set up for your service. Correctly. And that's, that's a daunting task sometimes cuz sometimes you don't know what all those should be. But if you run a series of chaos experiments to understand like, okay, this is my steady state of my customer experience, and then during these types of disruptions, this is how I expect the system to behave. you can basically just ensure that your service level objectives maintain that performance throughout that whole experience, through the steady state disruption, the recovery, the failure. And then you, you can decide if that was good enough because not every transaction would fail. There's always like a percentage, right? And again, as a business leader, maybe it's okay if like 5% of people air out when trying to buy bananas, right? Because it. low cost item and they likely won't get it at a competitor and they'll just wait a couple minutes. You know, for the system to recover, to buy that. Yeah.

Jim Hirschauer (Host):

Yeah. I love your banana analogy here. I think, I think you had mentioned to me once that bananas were like the top selling product at, at your retail store. Wasn't that right?

Matt Schillerstrom (Guest):

Yeah. Bananas were the most popular and when they ran out of stock, everybody freaked out.

Jim Hirschauer (Host):

So that's insane. That's so funny. All right. Well let's transition to a little more fun. So we covered the main topic that we wanted to talk about. Let's, before, before we go, let's, let's play one more game. And it's not necessarily a game, but it's just a little bit of fun. Every person who's worked in it for any meaningful length of time, they've messed something up somewhere along the way. I know I've done it. You know, I'll, I'll admit right here that I actually rebooted a production system once during the middle of the production. Oops. That's, that's a painful lesson to learn. But it happens, right? It happens to almost everyone. So Matt, what's your worst it mess up?

Matt Schillerstrom (Guest):

Yeah, I think, I mean, we've always had those production issues. I mean, I've had a countless database issues that I ran into that I didn't understand what I was doing, and I was overconfident and blew up a database, but the one that haunts me still. And it was a safe failure. But at the nuclear power plant we had. Rubidium time servers so that we can basically have exact measurement on anything that happened in the plant, just as far as timing. And when we were doing an upgrade of it, I accidentally like pulled out one of the cables so that it lost network connectivity and it all the other servers kind of like just went into this panic mode. And I had to reboot. and it like kind of had a disruption for like five minutes, but it was okay. Like everything kind of failed safe. So it was a good like disaster recovery.

Jim Hirschauer (Host):

It was a great test, what happened?

Matt Schillerstrom (Guest):

But it was unplanned and you know, at a nuclear power plant, you have to be completely honest all the time. So you had to report it, you know, talk through the, you know, the root cause of what the issue happened and like how you could learn from it. But it was a good way to validate like the procedures, processes and see how like people panicked or didn't panic.

Jim Hirschauer (Host):

Wow. That's, that's intense. All right. All right. Well Matt, listen, thank you so much for playing along and sharing all of your wisdom with us. Really appreciate it. It was a pleasure getting to speak with you today. And to all of our listeners, if you're an SRE or if you're in a related role and you wanna be a guest speaker on Ship Talk, please go ahead and send us an email. podcast@shiptalk.io and we'll get back to you. So that's all for now. Until next time.

Introductions
Just for fun #1 - 1 truth and 1 lie
Main topic - Efficiency, reliability, bananas, oh my
Just for fun #2 - Matt's worst IT mess up
Closing Comments