ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

ShipTalk - Heavy Rock, Netflix vs. Google, Continuous Resiliency, and a Lost City - a conversation with Adrian Cockcroft

March 09, 2023 Jim Hirschauer / Adrian Cockcroft Season 2 Episode 2
ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery
ShipTalk - Heavy Rock, Netflix vs. Google, Continuous Resiliency, and a Lost City - a conversation with Adrian Cockcroft
Show Notes Transcript Chapter Markers

In this episode of ShipTalk (The SRE Edition), Adrian Cockcroft shares his thoughts on "the 5th accelerate metric" - reliability - and the State of DevOps, using error budgets, and how to achieve continuous reliability. He shares some interesting stories about the differences between Netflix and Google, as well as some personal stories to boot.

Introductions
Just for fun #1 - Adrian's favorite hobby
Main topic - State of DevOps reliability metric, error budgets, continuous resiliency
Just for fun #2 - Adrian's best travel destination
Closing

Jim Hirschauer:

Welcome to Ship Talk, the SRE edition. I'm Jim Hirschauer, your host for today. And Ship Talk is a DevOps podcast, brought to you by Harness software delivery platform. And in the SRE edition we focus on reliability topics. My guest today is Adrianne Cockcroft. Adrian, welcome to the show.

Adrian Cockcroft:

Hi there. Great to be here.

Jim Hirschauer:

Appreciate you coming on. Look, I'm sure most of our listeners already know who you are, but if you don't mind, could you please take a minute to share a little bit about your background and what you're up to these days?

Adrian Cockcroft:

Well I've got a long career so it's hard to fit it into a few seconds But I retired last summer from Amazon and I'm now doing advisory and consulting kind of work and sort of semi-retired state. And I was at Amazon as a VP for about six years. Did open source and sustainability and a whole lot around Just all the helping customers move to cloud basically. Before I was at battery Ventures VC firm, middle of all of the sort of containerization of computing and all that kind of stuff. And most people know me cuz I was at Netflix for seven years and I, I kind of feel like the last decade I've just been explaining Netflix to people over and over again. Right. Cause we seemed like we invented a whole bunch of stuff there or figured out how to do things that people seem to find fascinating. So that's pretty much it. And then back in the long, long history, I was at Sun and eBay and a few other things.

Jim Hirschauer:

Well, congratulations on an absolutely amazing career. There's some fantastic highlights to your career for sure. Thanks. So before we get into the, the heavy meaty topic that I'd like to talk about today, we have this section that we do called just for fun on the show, and I ask people interesting things about themselves. So Adrian, I'd love to know, I'm sure you have lots of interesting things going on in your life outside of just work and technology. What's your favorite hobby?

Adrian Cockcroft:

So I've been I like music and I've been, I liked to have a hobby that wasn't sitting at the keyboard. So yeah, I could have done computer music, but that would've been more time sitting at the keyboard, and that was the thing I wanted to get away from. So back when I was in high school, I played bass guitar and a heavy metal band, actually a heavy rock band called Granite was my first band. So bad puns have been a feature of my naming schemes ever since. then actually played a few gigs with a few of the bands. And then I stopped playing basically when I got a job and got married and got busy with life. More recently I've been playing a little bit more, and now that I quote retired, I, I started taking guitar lessons to actually learn the music theory that I never really knew, cuz I was just trying to keep up with people. They say bass players hang around with musicians and drummers hang around with bass players and hit things So we're kind of you don't, it's pretty easy to play bass. You can just learn some patterns and try to copy stuff. But now, you know, I have, I keep acquiring more random equipment and trying to do things and I, I went through a bunch of old archives of stuff I was doing 40 years ago and, and put it on on SoundCloud. So if anyone's desperately wants to hear what weird things I put up there, there's a whole lot of archival stuff. And then a few more recent things that that I've been playing around with on soundcloud slash Adrian.Cockcroft or something similar so you can find it. So that's, that's kind of what I do when I'm not into to that. And then the other stuff is mostly out playing around with cars and the usual things that people do. But that's been my like I'm actually taking lessons to try and learn what I'm doing rather than just stumbling around.

Jim Hirschauer:

Yeah, I can totally respect that. I've, I've also played guitar or tried to play guitar for many years and have never gotten really any formal training, so I'm a little bit jealous right now that you're getting to head down that path. Do you have any favorite guitarists by the way?

Adrian Cockcroft:

I like a pretty wide variety of things. I just, you know, Jeff Beck passed away recently. I just went and listened to the entire back catalog to kind of realize just how much feel he had that he could put into the guitar. Yeah. There's a. Guitarist, most people don't know, called Eric Tessmer, who plays in Austin. If you're ever in Austin, you can usually find him playing. He's, he's a, like a Stevie Ray Vaughan, like guitarist, but never kind of got got to be hugely popular but i, I like him a lot. And then more into the sort of prog rock space, sort of Robert Fripp and things that no one has any, no one else can play, basically Alright. Complicated stuff like that. That's kind of the the, the, the scale of it, all kinds of different things.

Jim Hirschauer:

Awesome. Well, I actually live in Austin, so I'm gonna go see if I can find Eric and and hear a show.

Adrian Cockcroft:

Yeah, he's been, we first saw him in 2005. He's been there a long time. He's been, he's, he played some great gigs.

Jim Hirschauer:

A great recommendation. Alright. So let's jump into the main topic, if you don't mind. Here, here's what I wanted to talk about. So, in the latest state of DevOps report that I read they introduced reliability as. Quote unquote, the fifth metric, their fifth important metric to go along with deployment frequency, lead time for changes, time to restore service and change failure rate. And I'm gonna quote what they said in the report. I'm just gonna read it directly because I think it's really interesting. They said, when reliability is poor, Software delivery performance does not predict organizational success. However, with better reliability, we begin to see positive influence of software delivery on business success. And I think that's a huge statement because we've seen so many companies really laser focused on these four metrics and so many engineering organizations thinking that if they can achieve these four metrics and, and improve on these four metrics, that that is going to directly impact their success as a business. But this statement, seems to contradict that a little bit. It says, you know, those are important, but there's this other metric reliability that you need to focus on to really be successful. So I'd love to hear what you thought about those findings.

Adrian Cockcroft:

One of the things is they didn't really define what reliability is. Yeah. Yeah. They, they, yeah. It's sort of a self-reported. Do you think you have good reliability or not seem to be kind of what was being run off of rather than is there a particular metric and, you know, everyone measures it differently, but it's a little bit self-reported, so it may be that there's some correlation in there that's affecting it. The thing I think is in interesting is that if the system's reliable, And it's resilient then, and it's, it basically means it's able to absorb small shocks and things going wrong without failure, because failure's there all the time. The question is, is your system capable of masking that failure to the end users? If that's true, then you get to try things out more aggressively. If you're a developer and you have to go through a, you know, a really long, complicated QA process to get something out because everyone's terrified that you're gonna break the site, it slows down the pace of innovation. Sure. If you've got something where you can deploy it and if it breaks the system, sort of quarantines it out and you know, maybe one customer got a retry, nobody notices. Right. That's a very different world, and what I think is that if you've got really. So reliability characteristics, then you can actually run faster. And the sort of analogy here is running with scissors, right? Yeah. They say, don't run with scissors, you'll hurt yourself. Mm-hmm. right? Everyone says that. And then, okay. So if you like wrap the scissors in bubble wrap or so, or some kind of case so that they're safe, you can run with scissors. Yeah. Right. But you've, you've got some compensating thing around it. So it's, it's more like making it safe to go. and that's a lot of the value of resilience from a sort of a business point of view. It just just means that you spend less time firefighting and you are, you can do things that would in other environments would be dangerous or actually safe in in environments that have good sort of resilience and containment properties.

Jim Hirschauer:

Right. You, you made a couple of statements in there I'd like to unpack a little bit. First one. You called out rightly so that this reliability metric isn't actually defined as a metric in the report. It's, it's just called reliability. So do you have any recommendations for folks on what they should be doing to define reliability within their organizations?

Adrian Cockcroft:

Yeah, there's, there's a long topic here, but the classical sort of definition is up and down If you've got a monolith right? It's either up or it's down. If it crashes, it's down, right? Mm-hmm. And so people say, oh, if you, it's sort of this state where it's very clearly you say This thing is either working or not working. Very clean division between the two. You can measure the amount of time of it's in each state and say, that's your uptime, right? What actually happens in more sophisticated systems is they degrade slowly. So a little bit of it's slightly broken. Some number of customers aren't getting served, but most people are. Okay, so it's not down, right? Right. Yeah. But your, you know, 0.1% of customers are currently impacted because that feature they were trying to use or the, the route to their ISPs is down or something like that. So it's very difficult to measure that with a straightforward up down availability metric, which is based on time. So what I like to do is based on success rate. So if you look at, and the classic thing we did at Netflix was streaming starts, right? Mm-hmm. the core metric they have is how many people started tried to start a movie and did they succeed in starting a movie? You hit play, did you get your movie right? Right. And that was the metric. And it goes up and down. The rate is quite low at, you know, in the US at 3:00 AM. Right. It's like security guards watching it on their gameboys, whatever. and which is actually, I think pretty much what it turned out to be at the time. Yeah. And then, you know, and peak time is whatever, you know, Sunday evening was typically the peak time, which is when the site used to fall over. And that's annoying because everyone wanted to be at home watching TV with their kids on Sunday evening. Right. And that was usually when Netflix collapsed cuz it was a new all time record peak. You know, you get these kinds of cyclic effects, but what we cared about was what we generated was a metric, which was the number of times that we thought that we were missing. And then we did that percentage. So it was sort of, you know, how many nines or whatever it was, some percentage of uptime, three or four nines or something like that. But it was of attempts to deliver value to the customer that succeeded or failed. So that's where I'd like to see people focus on reliability as based on a customer facing value delivery metric. And it's typically there's some value being delivered by a website. And then there's customer signup, and that's the other flow that's very standard. You want to make sure customers successfully sign up right, to whatever your service is. So those are probably the two canonical things that I would typically, you'd have two, and then there may be more if you've got a more complicated product or set of products. But it looks something like the ability to get into the product and the ability to deliver value with the product are the things that need to be reliable.

Jim Hirschauer:

Got it. What I've seen companies do is use SLOs, assign SLIs and SLOs to these type of metrics and ultimately the the Google SRE handbook that says you should end up with error budgets as well, and you can use error budgets to do interesting things with. So I'm curious to get your take on error budgets. I've had conversations with a lot of folks who say they were never able to implement error budgets, they never actually got error budgets to function and provide the business value that was expected out of them. So, you know, have you ever implemented error budgets or do you have any advice for anyone trying to implement error budgets and really get the true value out of those?

Adrian Cockcroft:

Mm-hmm. Yeah. The Google SRE book was interesting because it was about three quarters of the same as what Netflix was doing, and a quarter was like the opposite of what Netflix was doing. Okay. So there was, so Netflix at the time I was there didn't have a central SRE team. There wasn't this process of handing off to a central team and defining error budgets. Basically, the teams owned their own code through the entire life cycle. They owned it. They owned a collection of microservices that were fairly stable and weren't changing very much, and they just had them running in the background. And then there were some that they were actively working on, but they owned the business function all the way out to the end, and they were on call for when it broke. And so the central team that was sort of like an SRE team didn't actually own the, didn't make changes to code. Their job was to find out what was broke and call the people that owned that and get them to fix it. Okay. And by, by and, and it's annoying to be on call so people generally got really good at writing code that didn't break. So we created a pain in the right place. So people built code and systems and tooling so that their code, they didn't have to be woken up at 3:00 AM If something broke, there was the system would, would manage it. So I think that's a good principle. It is difficult to get people to take, you know, effectively carry a pager if you're a developer, right? There is a mentality around it, but if you can get that to happen, it causes the whole environment to be much more. The other big difference in Google and Netflix, Google at that time had a lot of graduate hires, a lot of young developers, and we used to wait for them to spend five or 10 years at Google before we hired them at Netflix. Yeah. So Netflix was very much, a much more senior crew. Really, no graduate hires, no interns. So it was much more experienced people. And so there's, and it was a much smaller, sort of built differently from that point of view. So that's kind of some, some of the core differences in the way we did stuff. Now, we did sort of try to meet a quarterly goal on like, whatever it was, four nine s or something. If we'd had a couple of outages that quarter, we would occasionally say, well, let's bump this chaos region failover test to the next quarter in case it goes wrong. So we sort of used error budgets in sort of informal way. That if we'd had a series of outages, we'd maybe pull back on some of the chaos engineering, large scale testing where we were potentially gonna cause an outage. Right. And sort of so I saw a few times, but we're just like, let's do it next month cuz that's next quarter and if we survive that then we're good. But Right. We can sort of even out the customer pain in terms of nines of error rate. So that was, that's sort of informal error budget like thing. But we never had a formal error budget system. And I haven't, I haven't seen, I haven't seen it used that much as in, in a formal way, but I haven't probably looked at many environments. But I mean, there's plenty people have SREs. I don't hear them talking about error budgets that much.

Jim Hirschauer:

Yeah, and, and you know, quite honestly your informal process that you just mentioned is really, definitely gets to the heart of what error budgets are intended for and, and how they should work. You know, I, I get to talk to a lot of different companies and, it seems to me that one of the failings currently around error budgets is the difficulty in really implementing the process surrounding it. It's not so much. You know, implementing error budgets. After all, it's just fairly simple math, but there is a process that has to be initiated, like, Hey, we're gonna push this off in your case, we're gonna push off this test until next quarter. And that seems to be at odds with, you know, developers typically needing to meet timelines, meet deadlines, to release new features and functionality. So that becomes, yeah, you know, an interesting aspect to the error budgets.

Adrian Cockcroft:

The, the other problem you get is talking to product managers about error budgets. Have real trouble with the, okay, say, well, how, what? What do you want to do when the site's down? No, I don't want it to go down Well, what do you want it to do when it's down? And they just said No. But they don't ever want it to go down. They want it to be perfect. And it says, okay, what degraded states? So you have to kind of get them into a room and not let them out and force them to think about what degraded states are there and what should you be doing? And if this backend service goes down, the system can do this and it can't do that. And you have to kind of plan that into the product. So the product itself has some resilience, baked in, or some progressive feature sort of sort of turnoff. And you know, ultimately if the entire site's down, what are you going to do? And one of the things we did at Netflix, we created a static site that was somewhere else on a totally different set of infrastructure that just had some movies you could watch and that we had rights to. And it had no personalization and there was a mode where if everything was completely down, we could in theory just stand this site up and it would, everyone would redirect to it and you'd be able to just watch some stuff. I'm not sure if it ever really got used very often, but that was kind of, that sort of meant the mental model, the process of getting there cause a lot of things to be sorted out. Like how would you do that and how would you fail over from the main thing to some sort of backup that's at least telling you, sorry, we're down. We're gonna be okay eventually. Yeah, and I think, so Datadog had an outage yesterday, I think. And one of the things, I'm not sure, I actually haven't read the report of why they went down, but their status page actually, people saying their status page was actually still up and was working well, so, so. You have, it's the basic stuff. If you take, if your stainless page goes down, when you go down, you'll, you know, say no. Okay. You've done it, done something seriously wrongly. Right. You have to figure out how to think through what happens in these failure mode. Right.

Jim Hirschauer:

Okay. So that, that actually transitions into the last thing I really wanted to unpack that you've been talking about is resiliency and chaos engineering is a way to work on the resiliency of your systems. A lot of companies that I've talked to or had experience with have used chaos engineering, but it seems to be a very tactical approach where they've had one or two engineers that, that work on the practice and it's very siloed and segmented. Do you have any recommendations on how to make that more of a strategic approach within a company?

Adrian Cockcroft:

Yeah. I think it ends up being. Like who's responsible? Is it a developer concern or an operations concern, or have you sort of combined those together into a, a true tech DevOps combined organization? And I just go back to now, everyone's had backup data centers for years, and if you talk to people about how often they actually fail over to them, it's vanishingly small. There are a few people who are, banks are regulated to at least show they could do it once a year. Other than that, it's pretty rare to find people that have spent the money on a backup data center. They mostly don't get the value out of it, right? So if you look at bringing that into the modern world, what we really care about is in the cloud we've got all of these backup data centers are just there. If you can figure out how to use them, it's much easier to provision a failover than a physical data center. So how are you using that? The, the standard sort of canonical cloud setup is three zones in a region, and if a zone goes down, your site should stay up, but most people don't test that. So I would always start with, can you take a zone outage? And why can't you? Right. What goes wrong if a zone goes down in your environment? Are you routing traffic correctly? Do you know how to operate on a two zone system in some degraded mode? Because, That was something we sort of ran the test every two weeks at Netflix where we just shut down the zone to prove the site still worked on the other two, and every now and again it would catch something and we'd like rapidly back out of the test. Right. So that's... the most basic thing is if you, if you can survive on two outta three zones, you're in good shape and most people can't even do that. And then once you get that sorted, then you can think about multi-region. I would not even attempt multi-region unless you figured out, unless you can show that you can do a two out of three zone. Resilience, right? It's just mm-hmm. you're not, you're not close to it because you have to be able to disambiguate the failure. That means you have to shut down a region from the failure that you need to shut down a zone. And it's very difficult to figure out which of those is actually the problem you've got in the moment. So, That's kind of one way of doing it. And the other way is to just sort of have a weekly meeting that says, well, what if everything goes down? And just have a bunch of scenarios and do it as a paper environment, and then start building chaos engineering tests into the test environment or staging environment. And eventually, once you've convinced yourself, everything's looking good, I'd create probably a separate production environment in the cloud. You know, another account. And I'd set up the Chaos engineering tools in that account. And this is for the more operations sort of focused stuff, you know, killing machines, killing zones cutting off networks. Cutting off dependencies. So that's, that's the, but the thing that really I think is, is new and interesting, and it's something that harnesses is working at now is continuous resilience. So you take the CI/CD pipeline and you put chaos engineering into the pipeline. So every new build of your microservice goes through a series of chaos tests. to say like what happens if if when it fires up one of its dependencies isn't there or is slow. Yeah. Right, right. You've got a service running. What if the connection to the database, if the database is slow, is a common thing? Right. Most things keel over cuz their database got slow. Right? Right. Or the connection to some authentication service fails or something like that. You should have a graceful degradation. So this is more test driven development, if you like, but what you're doing is you're setting it up as a chaos test in the pipeline so that you can continuously prove that this little piece of the system is resilient to failures around it. And I think that more developer oriented approach is, is a really useful way of extending the chaos engineering principles down to something that's more, more useful and more continuously useful. And then when you get the failover of a region or a zone, whatever, you know that your individual microservice is going to behave sensibly during that process as things are failing and failing over and recovering from it quickly. That kind of stuff.

Jim Hirschauer:

Yeah, it's an interesting approach because ultimately there's a cultural aspect here, right? Like in your example, If you're gonna pull out a, a zone or a region, there has to be a willingness to deal with the consequences of that because you could have tremendous consequences as a business. So it seems like there needs to be buy-in culturally across the business and by making it part of a pipeline that, that seems to kind of embed it within the culture. Like there's almost no choice but for it to become part of the culture at that point.

Adrian Cockcroft:

Yeah, if you can basically say this, this microservice, if you cut one of its dependencies, it will give you a sane error message, and if you restore that dependency, it will recover quickly. Right. And it won't fall over. You won't have to like restart it from scratch. And there's, you know, cause systems come catatonic quite often once when their dependencies fail. Yeah. Right. So that, that means you've built some things more resilient, but you still have to do that full scale testing. And you know, if you go to upper management and you say, you know, what do you want to happen when the site goes down? Right? Do you want to, do you want it to continue to be down? Do you want it to escalate or do you want it to recover quickly and to be resilient to various types of failure and if they're doing something that matters is either, you know, financially or, or there's a high business value to it being up, then they should be asking the engineering teams culturally, so this thing should stay up. Right? It's important that it stays up. Right? Or, or maybe it doesn't matter. Right. And there's other things are more important because what you're doing is, you know, a movie used to say like, Netflix is just movies. Right. But, and the, it's not the end of the world. If Netflix doesn't work, you could use something else. The way attitude we had was your TV set, you should turn it on. It just works, right? And that was the, it should turn on and just work. You shouldn't have to worry about whether Netflix was up or down. It's like you're just trying to watch tv, right? Yeah. And, and you shouldn't have to think the, the least. You should be unaware of the fact that there's a whole bunch of random, weird technology stuff happening in the background. There should be no artifacts. It should never pause while you're watching something and it should just work, right? So that was why it mattered. And the brand value of that was basically the value that we sort of assigned to resilience.

Jim Hirschauer:

All right. Adrian, I believe you're gonna be speaking soon at Chaos Carnival. Is that accurate? Yep, yep. What's, what's your session gonna be about?

Adrian Cockcroft:

It's gonna be about this idea of continuous resilience and how that's kind of where we've got to in the whole chaos engineering world at this point. I've talked to various chaos events before on sort of the history of how we got here and, and a few different things. So focusing a bit on where we go next now.

Jim Hirschauer:

Okay, and I just pulled up the website. It looks like that's March 15th through 16th. Is that Chaos Carnival? So, yep, definitely. If you're listening in, go ahead and check Adrian out at that conference. All right, so Adrian, we've gotten through the meaty subject and now I'd like to have one more little fun segment. You've been a great sport so far for this I'd love to hear. Do you have a favorite travel story or a best travel spot that you've been to somewhere around the world?

Adrian Cockcroft:

So one of the more memorable places we went, so I was, when I was with aws, I was doing keynotes at various AWS summits around the world, and one of the ones I did was Tel Aviv. I think I did that twice. And the second time I went in with, well, let's go a few days earlier and go and see Petra. Which is in Jordan, but it's like, it takes like a day or two to get there and to get back. So we flew in a few days earlier and my wife and I flew down from Tel Aviv to Isla on the Red Sea and then got a, like a mini bus. And it's a big hassle getting into Jordan and the border and, you know, it's, it's a pretty much developing nation kind of stuff. So it's sort of, quite interesting exercise to just to get there. And you get there and then you go down this canyon and then all of a sudden you see the, the, the treasury, which is everyone knows if they've watched you know, the Indiana Jones in the last crusade, right? I guess they faked the interior of it, but the exterior of it, it's this amazing thing. It's 2000 year old building cut into the side of, of the rock, and we went all through that whole valley. Wife had a hurt her foot at some point, so, We got a got on a little buggy, so they had these horse drawn buggies and they're just charging off this pretty dangerous sort of this if you try walking down there were these horse drawn buggies sort of charging past you and people are camels going past you. And we, we avoided the camels, we did the horse drawn buggy and it was kind of cool to go and see all these buildings. It's an amazing place worth a visit. Very much a, a different world. And the, the story of the people that built it I is pretty interesting. Literally 2000 years ago, and then it was lost for a long time. Discover, rediscovered. I guess a hundred years, couple hundred years ago.

Jim Hirschauer:

Wow. What an amazing trip to be on. I have seen a few pictures online. It looks absolutely incredible. So it's definitely on my list now to go and check that out if, apparently, if I can get there. So, should be interested.

Adrian Cockcroft:

It's yeah, we did a, like an organized tour of things starting in Tel Aviv, like. They managed us getting there and back. Ah, good to know. Yeah, it's not the, it's not the best hotel experience or food experience. It's all about the, well, the food's fine, but the, the the overall it was, it was really about the, the ancient ruins. It was very cool.

Jim Hirschauer:

Amazing. Listen, thank you so much Adrian, for joining us on the show today. We really appreciate you taking your time out of your day. And to all of our listeners, if you are an SRE or if you're in a related role and you'd like to be a. On Ship Talk. Please go ahead and send us an email at podcast@shiptalk.io and we'll get back to you. That's all for now. Until next time, thanks. Thanks, Adrian.

Introductions
Just for fun #1: Adrian's favorite hobby
Main Topic: State of DevOps reliability metric, error budgets, continuous resiliency
Just for fun #2: Adrian's interesting travel story
How to be a guest on ShipTalk