ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

ShipTalk - A "Civ tech tree" for SREs, the M-word, and a new unit of measure called "Micro-Jacksons" - Steve McGhee - DevRel at Google

March 28, 2023 Jim Hirschauer Season 2 Episode 3
ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery
ShipTalk - A "Civ tech tree" for SREs, the M-word, and a new unit of measure called "Micro-Jacksons" - Steve McGhee - DevRel at Google
Show Notes Transcript Chapter Markers

In this episode of ShipTalk (The SRE Edition), Steve McGhee shares tells us about a new open-source project that helps SREs understand which capabilities are required to achieve increasing levels of reliability (the nines). He also shares some interesting stories about misadventures with user files and how Google handled the news that Michael Jackson had passed away.

Introductions
Just for fun #1 - Steve's activities outside of work
Main topic - r9y.dev, what it is, how to use it, how to contribute
Just for fun #2 - Steve's worst IT mess-ups and creating a new unit of measure
Closing

Jim Hirschauer:

Welcome to ShipTalk, the SRE edition. I'm Jim Hirschauer. Your host for today. ShipTalk is a DevOps podcast, brought to you by Harness, the software delivery platform, and the SRE edition focuses on reliability topics. My guest here today is Steve McGee from Google. Steve, welcome to the show.

Steve McGhee:

Hey, nice to be here.

Jim Hirschauer:

Great to have you, Steve. Could you please take a minute to fill in our listeners on your background and what you're up to today?

Steve McGhee:

Sure. So I'm Steve. I work at Google. worked there for a long time. I was a, an SRE inside of Google for about a decade. I worked on things like Android and Google Fiber as well as YouTube. I actually then left Google. I, I joined a, a company here in California to help move them to the cloud, which is like a super common thing for, for companies to be doing right now. And in, man, it was, it was hard. It was really hard. And I learned a heck of a lot in that you know, one to two years. I, I actually ended up going back to Google to kind of help more people. So for a while I was I had a role called Solution Architect which is kind of a common role you see a lot across a lot of providers. But now my job is in DevRel. So I'm a, a reliability advocate, which is a title that I just invented myself. So basically, I talk with customers and I do things like this and I talk about SRE and reliability and DevOps and all kinds of stuff that all kind of like overlap with each other. And I just try to help people understand it cuz there's a lot to learn.

Jim Hirschauer:

Awesome. Thanks for that background. So Steve I know you're familiar with the format of the show. We do a little bit of something up front just for fun. And so we'll start there. I'd love to tell you what is your favorite hobby outside of work?

Steve McGhee:

That's a good question. My kids keep me pretty busy, but outside of that I like to describe myself as a former athlete. So I was like a competitive swimmer growing up and I did triathlon in college. Stuff like that. So I, I try to like ride a lot of bikes and mostly mountain bikes these days cuz I'm scared of roads now, And don't blame it. And I lift weights, like that's my, my latest like, departure from like the, or the original thing. And I also coach swimming a little bit, so that's, that's pretty fun too. So like my kids. Are in junior high and high school and it's fun to kind of help with that kind of athletic side of town to, to, I dunno, become better competitive swimmers. It's, it's kind of like my lifelong passion is, I'm strangely good in the water, so I'm gonna keep doing that.

Jim Hirschauer:

Yeah. That's awesome. It's great also to give back and, and do a little bit of coaching at the same time.

Steve McGhee:

Yeah. It's really fun. Yeah.

Jim Hirschauer:

Yeah. Terrific. Well, I'm, I'm kind of excited about today's topic. Our main topic. We're gonna talk about a website called, or a project I should say, that has a website called r9y.dev, and for anyone who wants to check that out, that is the letter R, the number 9. The letter Y dot dev so please go ahead and, and jump out there to that website while we're, while we're talking about things. So. You know, this is something that we, Steve, you and I talked about this when we met at SREcon in Amsterdam last year. Yep. And I had seen it before that and was excited that we got to talk about it in person. So why don't you share with the listeners what is this project and and how did it get started?

Steve McGhee:

Well, like it's, it really just started out kind of as a joke as all good projects start out as and the idea was we wanted to build the Civilization tech tree. If you're familiar with the game, civilization love it. But for reliability, for like SRE and DevOps and stuff like that. So, we actually don't call it the SRE tree or the DevOps tree. Like we, we we're specifically constraining it to reliability and not constraining it to the Google SRE form of reliability. So it's like generically, it's just, you know, reliability stuff. And so the, the way that you want to think about it is, In Civilization, if you want to go to space, you have to learn pottery at the very beginning of the game. Right. And you have to know that, you have to learn that first. Like it's a, it's a, it's a hassle if you get all the way to, you know the late technology tree and you have to go back and learn pottery. Right. It's, it's a, that's a bummer. And so the, we, we felt that that was a good analogy to what we saw a lot of customers were struggling with because they would say, yo, I want to do something like super advanced, like space flight. You know, we want to do. Multi cluster canary deployments and like super complicated observability. And we're like, great. Like, let's, let's talk. And then we find out after, you know, a few hours or, or, or days or weeks of talking with the customer that they're actually struggling with something really simple, right? They're, they're, they're way further like to the left, if you will, on the tree than, than we expected. And this was before the tree existed, so we just kind of like, were mentally picturing this. And so we just decided one day to sit down and try to write it down. And so what we, the way that it's structured is the, the far left is like, if you think of it in terms of nines, the far left is like one, nine, like a, a, a not very reliable system. Like this is something that runs on your laptop and, and it's not, it's not available like when you close your laptop, you know, it's just, it's not a production grade system. And on the far right. Is like something that has like five nine. So it's something really like giant and, and you know, robust. And I think of like, Google Ads is like, has all of these capabilities built into its system somewhere. And they do like the super fancy stuff. So the, the trick is to kind of find where you sat, where you are. on the map, like from left to right and then figure out where you want to go. And then the map helps guide your way, helps you discover what it is you should work on next.

Jim Hirschauer:

That's amazing. So one of the things that really drew me to this when I first saw it was, you know, the conversations that I get to have in, in my work life, I get to talk to a lot of different companies about reliability and how they're approaching reliability within, within their organization. And I've often found that many companies. Have some ideas on what they want to do for reliability or they're cer they're at a certain maturity level, I'll use that term in their reliability practices. But often they are looking for guidance on, you know, what do they do next? Or maybe, you know, their, their initiatives aren't Producing the, the results that they really wanted out of them. And so they're really looking for some guidance. So that's, when I saw this, I thought, hey, this actually looks like something that could help guide people in a vendor agnostic way. You know, from a capabilities perspective, help guide them from a kind of a very low level of maturity or wherever they stand today through to as mature as they want to get or as it makes sense for their individual organization and applications.

Steve McGhee:

Yeah. We actually, when I talk to customers, I try to avoid the term maturity. I, I, I recognize it's a useful concept. Yeah. But like, if you're more reliable, that doesn't make you more mature. Like those, those I think are like, you know, not exactly the same term. Yeah. And also you can be extremely mature and not need reliability so much. Right. You might have a system that is totally background and like, can fail and it's no big deal and, you know Not, not to to pick on your wording, but like, you won't see a maturity matrix come out of this. Like, that's not the point. We've actually found that it's really com I think we actually put this in our little there's this other little book that we wrote. But, but we, we've found that trying to apply a maturity matrix to reliability is so, like, it feels like such a good idea and it always backfires, like, so we're trying really hard to, to instead focus on just building up of capabilities over time. And as you get more capabilities, you're simply more capable. Like you can, you can just do more things. And it's not actually reflective of, you know, what, you know, are you three out of five or are you four out of five? Or, or, or whatever. And I, I've found that that does help quite a bit. And, but you're right about, Being able to discover, you know, like a, like you said, like a vendor agnostic way through this forest, right across this map is really important because without something like this, all you're stuck with is the, like vendor non agnostic ways, right? And, and it's, you really have to interpret all the marketing material and you have to interpret the historical context that everything was built in. And, and none of it really relates to what it is that you're trying to accomplish as a team. And you just sit there and squint and you have to figure out like, what is, what do you think, how would this actually work if we used it? And that's, that's pretty difficult to do without this kind of like third party system that we're trying to build.

Jim Hirschauer:

Yeah. And, and I love that term that you used. I love looking at it from a perspective of capabilities instead of maturity, because that's absolutely right. You, it, you know, maturity is a very different thing than putting together all of these capabilities and being able to accomplish this or achieve certain things within an organization. On that line of thinking for, you know, having this matrix of capabilities where you compared it to Civilization where you have kind of a dependency map of capabilities that build upon each other. Is the project in that state today where all of those relationships are already built out? Wh What is the status of the project? How far along is it? What could people expect if they go out to the website?

Steve McGhee:

Yeah, it's not that far along Okay. So all right. The, we, we tried to have like a, you know, fully directed graph. Like that would be great. Like, you just, you have to do this and then this, and then this, and like, it would be perfect, but like, it doesn't, that doesn't work. It simply is like everything is so loosely related and loosely coupled on like, on purpose, which is great. That it's not that straightforward to have like a fully connected graph. So what we have instead, we have a couple like edges, a few arrows and, and mostly they were there to make sure that like the arrows worked in the ui mm-hmm. But we haven't really found that it makes a lot of sense to have really a, a strong opinion in when it comes to edges. There, there are some where it makes sense and there some of them are in there but, but really at the end of the day, what, what the, the, the part that's important is the kind of the columns, if you will. So like the, the vertical groupings of them like I said there in like, we call them like the nine. So there's the one nine, the two nine, the three nine, the four nine. And those are, those are pretty hand wavy still. But the intent is that the capabilities that are in those columns relate directly to your ability to react to an incident or to have some, like, it has to do with reaction time. And so if a capability involves like a human looking at a dashboard and like running a script that cannot be in the four nines or three nines category, like it, there's humans just aren't fast enough. And if, you know, if, if we have something that is like at, you know, beginning this and in, you know, intermediate that and advanced that, those are gonna be spread out sort of from left to right also. And they, they may have arrows between them. So yeah, if you, if you go and look at it today, you'll find, you know, there's a lot of boxes with words on them and some of the words seem a little ambiguous. And those we can help get, get help on. you click in on them, you'll see a bit more text. That kind of explains things there. I will admit there's plenty of lorem ipsum still in there like that needs to be right. like kind of filled in still. So, so yeah, I mean it's, it's a work in progress for sure. Still. I don't expect it to have a fully populated graph. I don't expect it to ever be done either. If there are, like there, there also are like, you know, the, the nodes that are on the graph today are not the final, you know, like set either. So we want to be able to take recommendations on like direct contributions from the community. Right now actually getting it into the graph is strangely difficult. Adding words is easy. Like we can take PRs directly to add, like, actually you did that, like I remember you were one of our first PRs, right? Yeah. To, to add some, a bunch of text. and then, but adding nodes we have to do it just because the tool we used is like this hassle thing to use. But we do accept, in, in, in that form, we just accept issues. So like we're using GitHub if you want to add or modify the existing like boxes that are on the graph, you, we ask that you just file an issue and, and just kind of explain what you think we should change and then we will figure out how to do it. So that, that, that's the state of, of the game today.

Jim Hirschauer:

Alright.So it sounds like the project is off to a pretty good start. There's definitely some work that needs to be done. It's, it's never gonna be finished, right. But you definitely are looking for people to participate in here for people who are passionate about reliability and can I add the word resiliency in there?

Steve McGhee:

Well that's a, yeah, I mean, so there's, there's reliability, robustness, and resiliency and, and my impression is, there's humans are the ones who are resilient, generally not, not systems. Mm-hmm. Systems can be made to be robust and out of that you can gain resilience. Like your system will, will be re and it, it will have the quality of being reliable. So, yeah. Yeah, those words tend to be intermingled quite a lot, but like, there are like some clear definitions, but I don't, I don't think the community is, is too. You know, needs it to be perfect every time. It's, it's pretty tricky.

Jim Hirschauer:

Alright, so the call to action here is if you are passionate about the subject. If you're interested in the subject, please go out, visit the website, look at the project, and see if you have information that you can add to the project where you can go ahead and help the community in general.

Steve McGhee:

Yeah. And, and if you have if you represent a tool or a vendor of some kind like just, just like, like you guys do the, the, we don't need to be like, we don't need to like hide the tools and the vendors. Like we actually want to like, put them right in the list. So if you click on something like CI/CD, you'll see there's a list of tools, right? And so the intent is not to have a capability that is named after a tool but when you click into that capability, you should be able to see which tools will get you there. Right? So I don't wanna like hide, I don't want this to be like super abstract, you know, textbook. I do want the presence of vendors to be discoverable. And the idea is that we want to have people who are building out capabilities be able to say like, Hey how do I actually get this? Like, do I need to build it or can I buy it? Or can it, like, is there an open source solution? We want to be able to be able to click into capability and, and be able to make these decisions pretty quickly.

Jim Hirschauer:

Awesome. That's great to hear. I think there's definitely a need for, for that merge of capability with how do you get that capability, right? Need to, to help people find a way to get that capability. All right. So that's the, that's the main topic for today. At ShipTalk, we have another section following the main topic. Another just for fun section. And one of my favorites to hear about is IT mess-ups, right? Because I've, I've lived the life, I've had my fair share of mess-ups when I've been working. So, Steve, what is your worst IT mess-up that you'd like to share?

Steve McGhee:

Well, I have two that I'd like to share. One, I was definitely me. I definitely screwed it up. And the other one isn't really a screw up, but it's just a fun story. Mm-hmm. And so the first one I was a intern at Sun Microsystems. And the, the kinda weirdest thing about this is the place that I was sitting, like my desk is now a Google office. And I used to have a team, you know, 10 years from that point, which sat in the same exact spot. Oh, wow. So it was like kinda this funny Silicon Valley like moment. Yeah. But anyway, when I was an intern, my job was to like move people's computers, like their, their work computers, like from the new office to the old, or from the old office to the new office or whatever. And part of that was moving their data between like filers and like setting up their printers and all this stuff. And at one point, I was doing thankfully it was only a chmod, it wasn't like a rm or a move, or a, or a chown or anything. And I, I definitely like recursively, chmodded, a lot of people's, you know, work files Yeah. In, in like a consistently incorrect way. Yeah. And I got so scared and my, you know, intern host or whatever, like came over and. he's like, this is fine. You didn't delete anything. We can easily change that back. Like Yeah, like the owner of the files can fix this stuff. Fine. Like, you didn't actually destroy anything. You just like made it weird. Yeah. And that's like, that's fine. So that, that scared the heck outta me at the time for some reason. But that was when I was like 20 maybe. Yeah. And I just realized the other day that I've actually been carrying a pager for work since I was 19 years old. and like for computer stuff, Wow. Like I don't know if maybe it was, you know, it was probably 19 and pretty, pretty much consistently. And by pager, I mean, you know, an actual pager and then Okay. Eventually like a, a cell phone I guess. But like with paging capabilities built into it. But the other story that is maybe more interesting cuz it's like involves Google was, I, I was on the team that ran mobile search when it was very small. Mm-hmm. Back when mobile wasn't a big deal. And I think it was actually pre smartphone, so it was like pre Android, pre iPhone or maybe it was like right around the time when they first came out. Yeah. So we had a way that you could search, you know, Google search from your phone, and it was kind of like this niche thing. And that was the, the, the, the moment that sticks in my head is really interesting was the day that Michael Jackson died. Mm-hmm. because that was the first time where like a thing happened in like the worldwide visible culture where everyone immediately like asked their phone if it was true. Yeah. And so we had this like giant, giant, giant spike of requests. like you've seen, you think you've seen big spikes in your past, but this is, I I guarantee you this is bigger than anything you've ever seen. Right. This wave of requests coming in just cuz of, you know, one piece of, you know, very well distributed news. Right? And it was, we were not prepared for it at all. And so we ended up doing this thing where we had all of these, basically we searched front ends that were out in the world and they were all at full capacity. They were just ha just getting hammered. And we basically like went rogue and. Took, you know, essentially like the lowest, cheapest possible machines you could get within Google. They're, you know, virtual machines, not real machines. And we just got as many of them as possible. Yeah. And they, the, the deal with these is that if you get them, someone can take them from you cuz they're not you, they're not actually yours. And so we just got more and more and more and we just kept, like, stealing them as many as we could just to bring up these front ends. And it worked pretty well for a while. And we kind of recovered from the, from this massive wave. but it was like definitely an act of you know, encountering a, a huge external uncontrollable event in a, in a complex system. And it was, it was pretty fun. I have a joke about it, and I hope it's not in bad taste, but like the, the, the event was like, it was so big that I, I want to have like a measurement for it. Like it, and I like to refer to it. Like that was one like Michael Jackson level event, right? And so whenever I have these like huge events and I try to compare them in my head. I, I try to measure them in turn in, and the unit I use is Micro Jacksons So this was a, a 500 micro Jackson event. like it was half as bad as that day. I like that. It's a silly term and, and I, I don't mean offense by, by, you know, using the name, but it's, it's, it's just, it's just too good. To walk by.

Jim Hirschauer:

Yeah, totally understood. And, and it's incredible what sort of world events can, can, you know, trigger a massive wave that hits a website.

Steve McGhee:

Yeah. Why was it that one? Yeah, just it was just timing. Like just people had phones in their pockets and they're like, oh, I've heard you can search for things on these little pocket devices. Like, let's try it. Yeah. You know, it's like people at pubs like going, oh, did you hear? And it used to be when you said, did you hear? You just kinda went, no. And that was it. But now it was, now you would. No, let me look that up. You know, and like, that was a, a, a cultural shift, you know, it was pretty neat to see that on a graph.

Jim Hirschauer:

Yeah. Amazing times have changed. I, I had to explain to my kids the other day what a pager was. Like, what the actual device was and how it worked. They had no idea what it was.

Steve McGhee:

Yeah, try explaining the name too. Like page, like a piece of paper, right. It's kinda weird. Yeah. I'm not really sure how to explain that actually. Yeah.

Jim Hirschauer:

Alright, Steve, well listen, thank you for sharing your stories with us. Thank you for telling us about r9y.dev it's a great resource. So again, call to action. If you are interested in it, please jump out there and, and join the project and help the community by sharing your knowledge. And to all of our listeners, if you are an SRE or if you're in a related role and you'd like to be a guest on ShipTalk, please go ahead and send an email to podcast@shiptalk.io and we'll get back to you. Thanks again, Steve. That's all for now. Until next time.

Steve introduces himself
Just for fun #1 - Steve's favorite hobby
Main topic - r9y.dev project
Just for fun #2 - Steve's worst IT mess ups
How you can be a guest on the show