ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

ShipTalk is the podcast series on the ins, outs, ups, and downs of software delivery. This series dives into the vast ocean Software Delivery, bringing aboard industry tech leaders, seasoned engineers, and insightful customers to navigate through the currents of the ever-evolving software landscape. Each session explores the real-world challenges and victories encountered by today’s tech innovators.

Whether you’re an Engineering Manager, Software Engineer, or an enthusiast in Software delivery is your interest, you’ll gain invaluable insights, and equip yourself with the knowledge to sail through the complex waters of software delivery.

Our seasoned guests are here to share their stories, shining a light on the do's, don’ts, and the “I wish I knew” of the tech world. If you would like to be a guest on ShipTalk, send an e-mail to podcast@shiptalk.io. Be sure to check out our sponsor's website - Harness.io

All Episodes

ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

Site Reliability Engineering 101 - Bob Strecansky - MailChimp

November 03, 2020 • Bob Strecansky • Season 1 • Episode 0

0:00 | 35:35

In this episode, we talk to Bob Strecansky who is a Staff SRE at MailChimp. A packed podcast about all things Site Reliability Engineering (SRE). Learn about how to become an SRE, the rise of blameless culture, a clear definition of black-box vs white-box approaches, and much more!

Ravi Lachhman: 0:00

Well, Hi everybody, my name is Ravi Lachhman. And today on this episode of Ship Talk, I'm really happy to be having one of my buddies here, Bob Strecansky who is a Staff SRE at a company called MailChimp. thanks for being on.

Bob Strecansky: 0:13

Hey, Ravi, thanks for having me. Excited to be here. Absolutely.

Ravi Lachhman: 0:17

Iactually learned a lot from Bob. You know, it's Bob and I a little bit history I use Bob I used to work together years ago, and then our kind of careers took took us to different places. But I always keep running into Bob and SRE related event. So a lot of my learnings in Reliability Engineering actually comes from Bob, you know, he has he's very well grounded in the profession. But today, we're going to be talking about if you don't know what an SRE even is. Site Reliability Engineering, well, let's talk about what is reliability? Why is it important? What do these SREs do? How can you become an SRE? And anything in between that? So, Bob, let me ask you the first question, like what and how? How would you define Site Reliability Engineering, or maybe some of the lead up to like how SREs became popular today?

Bob Strecansky: 1:04

Sure, I'm happy to talk about that. So Site Reliability Engineering was an idiom defined by Google a couple years ago. And essentially what they did is they took a lot of the toil or like repeatable, boring work that just like war on engineers, and they put software engineers into positions that are class that were classically to find a system or systems roles. So to give you a little background on that, many years ago, when you would have a large scale website or a web app, or whatever you would have, you'd have individual servers that were artisanally crafted. So you would, you would install the operating system by hand, you would add Apache and PHP and all sorts of other packages like this. And then each server was individual we got we often call them pets. So like in this particular idiom. So after a while, we realized that this wasn't tenable for for a number of reasons, like creating new servers is a process and it takes up a lot of time. And then keeping all these servers up to date, and making sure that they're on track for whatever you need them for is bad. So things like Puppet and Chef and Ansible started coming out where we can automate server platforms. Next, we came out with all sorts of other things like mezzos, and Kubernetes. That allows that allow us to do more infrastructures code. So this all ties into Site Reliability Engineering, because Site Reliability engineers tend to put put code on paper for things that used to be manual 12 some tasks. So the primary goal of a certain liability engineer is to automate away toil and to make a much better experience for the developers that work on your product.

Ravi Lachhman: 2:58

Yeah, that makes makes perfect sense. It's kind of a natural evolution of like, and you like just like, keep the listeners, like Bob made a very important point there. I always like to talk about engineering burdens, the back of the day, you know, back in my day, it used to be like a one engineer to maybe like 10 server ratio, right? Like, yeah, one engineer can, you know, maintain 10 servers, right? And that's asinine to think about today. But that time wasn't that long ago, because you have to go manually patch things, manually update things. And, you know, if you have to update the version of like something in the operating system match how long it took you on your Windows laptop, just one of them right? Now you do, what if I had 10? Yeah, like, that's how long stuff to like, versus today, if we're dealing with Bob mentioned a few containerization technologies like Kubernetes, or mezzos. And so we're dealing 10s of thousands or hundreds of thousands to one engineer. And so like, the approach that you have to take at scale is quite different. And this is where the, I love what you said about artisanally crafted, I thought of a beer when you said artist of a craft beer versus, you know, a keg of like PBR, right, like which one, but both are both are good and their use cases, right? But that's, that's hit the nail on the head, like, hey, just because the nature of the beast nature that the firepower that we did, the focus is a little bit different, you know, kind of the approach, it is a much more software engineering based approach. Actually, I stole that for Bob, a while ago. He said, it's a beautifully like, Ah, you know, it's like software engineers facing system engineering problems. And I think a natural question for our listeners might be well, Bob, a lot of what you said was like, sounds like a DevOps engineer helping out the engineers. But let's talk about somewhat of like, your specific skills, your specific skill set more around reliability, and then we could talk about how you got there because like, I, if I was still engineer, I would want to be an SRE, like, if I assume engineering, so, but let's talk about some specific skills that essary like brings to the table. Let's say I was an app owner. I had Ravi's Application It's like, okay, you have enough traffic now. So you can essary like what would be something some Bob would talk to me about.

Bob Strecansky: 5:08

So SREs is very frequently defined differently at many different companies. The way that my company MailChimp, handles Site Reliability Engineering is we help to enable the developers continue their momentum with feature sets. And very frequently, this is done with something called the service level indicator and objective pattern. If you're curious about reading more about those, we wrote a podcast for the deliver better website that you can read about us SLIs and SLOs

Ravi Lachhman: 5:38

is really good. Yeah, I learned a lot.

Bob Strecansky: 5:40

Thanks but to wrap it all up into a nice neat bow is Service Level Indicators indicate the current state of a service and service level objectives or goals that you want to set for the service that you're working on. These both tie up into something called an SLA, which you may or may not be familiar with. That is normally the agreement, the service level agreement that you give to your customer. So like, very frequently, people in software engineering will say, Oh, I expect 99.9% uptime for this, or I expect to have very few errors, expect to have a good experience with duration. So these things are all measured and quantified and put into monitoring patterns so that developers can can continue momentum until they recognize Oh, I need to be careful, because the service level indicator that I have for my product isn't up to snuff. And the objective that we set isn't being met. So I need to slow down my, my product velocity and ensure that we're giving our customers the best experience possible.

Ravi Lachhman: 6:43

Yeah, that makes perfect sense. I think a lot of times like what like, like, so to bring in an expert like Bob, right, like, I'll play devil's advocate used to be an active manager, or app owner, you know, we would focus a lot on the SLA, like, we had the API, and we had to have a 500 millisecond response time, or at least some sort of response. Right. But that's it like as, as we get more sophisticated, you know, I only had a few endpoints, so it was like, okay, and everything. But as we get more complicated, yeah, we have to have other ways to track right. So like SLI is so close, you're leading up to the ultimate SLA like subdividing things. So that's, that's perfect, helping, and I actually, that was the first time I ever heard that explanation of an SLA. Usually they're there, you know, it's it being developer focus that I like that type of championship, right? Like, Hey, your partners, sometimes you need to focus on are you your future velocity too fast? versus are we doing quality work, or technical debt work to make sure that we're making at least maintaining our mission? That's, that's actually a really solid, a really solid definition of it.

Bob Strecansky: 7:50

So maybe for, okay, so that, yeah, that's how Google has this with their, with their SREs, they say, Okay, well, we'll be happy to have an essary work on your product with you. But we have to set the Service Level Indicators and objectives to make sure that we're reaching our goals. Very frequently, it's so simple as a software engineer to go, Oh, I just want to implement this additional feature, oh, I just want to add this, this new pinwheel, or I just want to add this letter, the next thing, but what you have to remember is, as the web continues to scale, both horizontally and vertically, you need to ensure that your service has the ability to run first, and then second, run at a rate that's acceptable to your customers, if you start getting page load times that are 10 or 20 seconds, customers are gonna lead you leave your site, if you start spewing out a lot of errors, customers aren't going to want to come back and they're going to lose faith in your service. So you have to balance product momentum and stability. And that's what, that's the that's right in the Site Reliability engineers wheelhouse.

Ravi Lachhman: 8:53

Yes, that's pretty awesome. I mean, I'm excited about essary, you know, not want to be one again. But for some of the listeners, you know, who like maybe dipping their toes into, like, you know, I have a long term goal of being an SLA. But most like about your journey, Bob, like, if, let's say, I was fresh out of like, school, you know, what would you tell me about? I want to be more like, Bob, how can you give me some coaching on that, Bob?

Bob Strecansky: 9:21

That's a good question. Ravi. So right out of school, engineers are usually trained in a different manner than what happens in quote unquote, the real world, right? For sure. We learn about it, you learn about data structures and algorithms, you might learn a couple different programming languages, you might learn memory management, or CPU utilization, or you know, whatever. But when you get into industry, you start recognizing, like, okay, all of these, of these things that we learned in computer science classes are important and it's important to understand how they work, but we also need to make sure that we have the ability to implement In these programming languages and these idioms on actual computers to serve actual customers, so the thing that I would tell a new software engineer that wants to move towards essary, is make sure that you're monitoring everything. When you push out a new feature, make sure you're understanding how the deployment process works, make sure you understand how to monitor for errors and duration and request count and things like that. There is a very famous paper by Tom Wilkie about red metrics, requests, errors and duration, that's a really great way to understand, like the importance of of monitoring those three things, the request rate, the error level and the duration count. And so I think that that's the most important thing, like understand how to implement a logging pipeline properly and understand how to set alerts for when things aren't working, as you expect. And that's like, that's a very large step towards being an essary.

Ravi Lachhman: 11:01

Yeah, that's, that's awesome. I think, in my software engineering career, a lot of what you just mentioned, there was usually like an afterthought, right? So like, it used to be an afterthought, right? Okay, we need some sort of alerting. So like, right, we're about to deploy new features, like, yeah, I think if we violate, like, some egregious SLA, you know, let's, let's alert to send it to like some sort of knock or something and alert to it. Versus like, it actually becomes very much at the forefront. If you look at what happened in the last five years, right? That afterthought has become the forefront like, Hey, you need to understand how to measure your application when you're building it. And a lot of times that expertise is not what they teach you in school. It's not what they teach you when developing software, it's usually like ah lets includes some log statement. That's usually what I would do, right? Yeah, listen, log it here, then we'll turn the verbosity off if you do anything, but you know, there's there's definitely a science to it, because you can impact a lot. Like the common, the common argument around turning logging up is that it takes a lot of horsepower to log something. And so there's, you know, Bob definitely dissect the science of like, what, like, when do you do certain things? How do you measure certain things? And this leads me to, so I'll jump ahead to a board vents. That's the topic. And so when you start talking to like, SREs, and let's say they're trying to engage like the resilience of the system, Bob, there's a there's a term called Black Box versus white box monitoring. So why don't we talk about in generic terms, what is a blackbox? What is a lightbox? And how does that change your approach? What do you what do you just do do anything?

Bob Strecansky: 12:34

Got it. So blackbox monitoring, is monitoring where you act as the end user, you don't have the ability to see inside the distributed system that you're attempting to monitor. And then blackbox monitoring is just like a, you can think of it as like a clear cube, you can see all the different pieces of the puzzle that makes up a request to the application that you're serving. These are both very important for monitoring the resiliency of a system. blackbox monitoring is very important because it gives you empathy into the into how the end user sees your application. You can say this, this particular endpoint takes, like, as Ravi mentioned, like 200, or 500, or 1000 milliseconds to, to report back, then you go, oh, wow, that's way too long for this particular API endpoint, or that's way too long for this admin portal, or that's way too long for this. XYZ. Same thing with errors. Same thing with request rate, like, you know, actually, I guess, request weight rate wouldn't really fit into that. But duration areas are certainly very important for blackbox monitoring, for white box monitoring, that gives you the ability to look at all the different pieces of the request, right? Like, you may be making a request to an app server, or database, or a key value store or some other distributed piece of technology that gives you the answer that you need, so that you can respond to the client correctly. Being able to see all the different pieces of that distributed system helps you to determine where a problem might be arising. blackbox monitoring is really great to catch egregious errors and large scale things. Like facts. Monitoring is also often completed from outside of your infrastructure. This is very important. Because if you have blackbox monitoring inside of your distributed system, your distributed system could fail. And then your monitoring for your distributed system will fail, which which puts you in a very, very bad place.

Ravi Lachhman: 14:31

Don't monitor the system you're monitoring with the system, right?

Bob Strecansky: 14:35

Yeah. It makes somebody Inception stuff, right. Yeah. Yeah. blackbox monitoring is important to build that empathy and to understand where your where your system is failing from a customer perspective. white box monitoring is more for understanding where in what part of the chain is broken in your request flow.

Ravi Lachhman: 14:57

That's an awesome explanation. I think Like that this this cheating, like, you know, any sort of SRE event I go to, there's always some sort of talk about that right? Like, Hey, what do you have control over? versus what you don't have control over? Like, just what, that's actually more ornate way? It's actually a very good way. First of all I heard usually it's like blackbox imagine it's like a piece of packaged software like Siebel, you know, like you don't own it, right? If you don't like it's up to the vendor, like, you know, to kind of tell you how to more what are they monitoring for versus an application? Let's say, Ravi, Inc. Ravi's what's for lunch application, like, I wrote it. So I have complete control over all the calls. And so we know how to, like we can have different ways of instrument and you have different ways of measuring it, versus like, you know, Thou shalt not touch this java file, this JAR inside of Siebel, right, like, Oracle will come find you. But that's it very, very interesting way, I think, what would be also a very intrinsic question. Um, and this is always great to hear how essary would answer this question. You know, like, it's about technology choice. Like, if, let's say, we want to look at a new technology. Everybody has a different answer, like, Oh, you know, feeds and speeds. This is a new, cool, hot, shiny penny. But as an essary, let's say we were I'm very curious about this these days to like, if we were investigating any new technology, like, hey, I want to say I work at MailChimp. Like Bob, I want to leverage, you know, the Kubernetes. You might be using it there. But like I say, I'm the first team to do it. Like, I must have two of them. Because I read an article, it's the latest and greatest, like, what would be some of your decision criteria? Like if you're like, advising people, just any sort of new technology? Like, like, what would be your train of thought for that?

Bob Strecansky: 16:48

So an ex coworker of mine, Dan McKinley wrote a great article, it's called choose choose boring technology, which, as a software engineer isn't very fun or exciting, right? That's true. Everybody wants to use the new cool piece of software. And everybody just always assumes, oh, you know, this, this new JavaScript framework, or this new Kubernetes thing, or this new logging pipeline, or this new pub sub q or whatever, you know, whatever the new technology here, these are all the new

Ravi Lachhman: 17:20

technologies. Yeah, you must read. You read the InfoQ sir, I see.

Bob Strecansky: 17:25

hackaday. I've been on Hacker News before. But what you have to remember is, at the end of the day, all of us are that are working in the software field or getting paid to deliver some product to some customer, whether it be b2b, b2c, internal developer tooling, doesn't really matter. We are all working to deliver software for somebody else. And this is important to remember, because you have the if you have the choice, if you make the choice to choose boring software, or only spend your we call them innovation tokens. So like a new project should have one innovation token. If you only spend your innovation tokens on something that's going to help the business, then you have the ability to still iterate and choose new technology that will be fruitful for the project that you're working on. And it allows you to slowly iterate into new software, rather than just like diving in the deep end and then floundering around for a while trying to make sure that all of this that you understand all this new software, if you implement new pieces of software slowly and methodically, with good logging, monitoring, alerting, documentation, rules, rollout strategy, all of these things, then you can slowly input introduce these new bits of technology in a meaningful way, rather than just rushing in and shoving them all in

Ravi Lachhman: 18:51

That was lik the most insightful piece of dvice I probably heard in the last year. Right. Like, it', it's a it's actually very, ver artistic, you know, like, how do you bring about change? Don t forget the fundamentals. Rig t. I was thinking in the bac of my head, let's say I was sta ting a new project today. So I w nt to be using sto and Kafka and Kubernetes and I need to hav those. I'm FluentD like I want to I want my resume to be like jam packed. You know, like when I'm done and something that Bob nd I had side conversations on t e outside the podcast be, don' be troubleshooting on the blee ing edge. Imagine I came up with a minimum viable product usin Kafka Kubernetes, let's say ven some sort of serverless like Kay native, I'm using all the uzzwords, I have the buzz ord app, or platform, you know, troubleshooting something on the bleeding edge. It's like you when using something one technology, even it compounds itself, using more than one technology that's on the bleeding edge. A lot of those operational fundamentals might not be there. Like there's still people bickering about how to what's the best way to trace and metric on a distributed containerized workload, right. So like, we can sit here and talk for like an hour on that right but like That's right. Like introducing one piece at a time, like once you get the fundamentals right, like, Hey, this is the minimum standard of an app. So yeah, like very, very beautifully said there, Bob. incremental success builds succe

Bob Strecansky: 20:13

as the owner of a as the owner of opentelemetry. php, I can tell you that a distributed tracing system is not easy ever.

Ravi Lachhman: 20:23

Yeah, well, yeah, you do. You do have a new package out there. Maybe you want to talk about a minute like, Hey, what's your open telemetry PHP package like Bob has? is bringing open telemetry? Actually, you're an author, too. Let's plug Bob a little bit, Bob. I got. It's over in the corner, though. I need to get you to sign it next time I come visit you. Yes.

Bob Strecansky: 20:44

Yeah, post post COVID. book signings.

Ravi Lachhman: 20:47

I want you to write a letter.

Bob Strecansky: 20:51

Yeah, so I wrote a book. It's called handles on high performance with the go, it's available on Amazon and packt Publishing websites. It's it talks about how to implement go Lang effectively in your distributed systems. And I'm currently working on the open telemetry project, which is a distributed tracing library, I am slowly and surely working with others to build the PHP version of this library, and contributors are welcome. Open telemetry is a new age tracing library that allows people to trace across distributed systems in a meaningful way and post the records to tracing aggregators that can help you determine where there is a fault in your system, sort of like the white box monitoring that we were talking about earlier in the podcast.

Ravi Lachhman: 21:38

Awesome. Yeah, be sure to check out like the openSUSE monitor PHP, or like, paste the link to like Bob's book, I, I got a copy of it. I'm learning a lot about go. That's what we go for in Hollywood gophers needs.

Bob Strecansky: 21:54

I'm glad that you brought that up, Ravi, b cause one of the really nice t ings in distributed systems n w is these companies are s arting to distribute more v sible binaries. So for I'm g ing to use your gr example t at we were talking about. Y ah, like, so. Ravi was tal ing about Siebel, which is an a counting software that is that t at's Oracle. Right.

Ravi Lachhman: 22:16

Yeah, that's Oracle. Yeah, it's like a, it's like a financial and like customer service, like, like CRM software, but yeah,

Bob Strecansky: 22:23

So, previous y, Siebel would be a complete black box to Ravi, right? L ke, you run that jar, and then maybe it'll give you some log ing output if you're lucky. A d you just have to hope that the the Java, the Java, virtual nvironment doesn't explode nd spew bits everywhe e. Now, one of the big Site Rel ability Engineering paradigm that's been very warmly w lcomed, is having somethin called an exporter that goe along with your, your particul r, binary or service or whatever And the exporter exposes ome of the internal metrics or a binary or a service r so on and so forth. And you an use services like Promethe s to view the those exports nd determine what to do with you closed system. It's there, t is idea of an exporter and usin something like a time series, ut like for me, yes. It's rea ly it's a really nice way to m nitor a historically seen as lackbox monitor. Yeah, blackbox system.

Ravi Lachhman: 23:31

That makes a lot of like, there's only a huge rise in that. Yeah, if folks are like seeing, like, you know, going back to Hacker News and info q like words like promethease, FluentD StatsD, yo r, you know, like, all these pa ticular like CNCF projects t at are out there, there's kind o like a meteoric rise in that. A d it shows that, hey, there's d fferent ways of thinking about m nitoring different ways of c pturing metrics. So that e porter examples actually a g eat one, because you're spot o, like, if the JVM crashed? No S l. So not a lot, right? Like I w uld what, you'll get a, you m ght get a crash report from t e JVM, but that's it, like the m trics will stop at some point v rsus having some sort of s decar process. Like, it's b sically introducing like s ftware engineering excellence i to problems that were always a afterthought, right? Like, if y u and I sat down and we said, l ke, take it back like a decade a o, like Bob, like, actually, we might have been in the same te m. So it was a team. You know, we if instead of being an afterthought, we put it to like a fourth thought saying, We must make sure that we get metrics, even in case of a crash, we would we would organize our logging or organize the processes that produce that in a different format, which would max to export it to today, right? So like, it's been a lot of catch up, but there's a lot of emphasis on being more proactive versus reactive, right. So as these SL A's become more tight, you know, require uptime, it's definitely a shift to becoming from becoming reactive. I need to wait There's a problem versus, okay, we can like kind of like foresee that there's a problem, I'm giving you a slight plug of like how a consumer expectation is just like, we expect things to be up all the time, right? So you just have to be proactive, funny story. You know, for the owner better, we actually use Bob's company MailChimp to manage our contact list. If for some odd reason, like, what what especially like, last week, I think, Bob, I couldn't log into MailChimp. And I was like, what, like, I know exactly who I should go. Talk to evolves like, Yeah, I got the on call alert for your particular record. It was you, Ravi. Oh, that's pretty funny with him that the problems like resolve, like, you know, with very, very quickly, and I was like, Hey, I can just try like a different link written word about it. But that was amazing, right? Like the amount of proactiveness that you know that some still that like MailChimp, chose the essary discipline is super strong. Like, you're able to capture that I had a problem before, like, within like, milliseconds of me even knowing I had a problem. Like, what does that culture take off like that? That's the basic stuff.

Bob Strecansky: 26:11

So this is that's a great, that's a great lead in Ravi. I think this t es back to our service level i dicators and objectives. I a tually am disappointed in that i cident, because of the way t at you experience downtime. B t I am proud of the fact that w were able to quickly r mediate the problem you had t at. So having these metrics l ke we were just talking about, i's sort of all ties together. N w. Having exporters' export m trics, and being able to m nitor them over a long time g ves us the ability to see t ends, right. So in your p rticular case, like one server j st started trending upward in C U and disk utilization. And as w as we caught that, we noticed t at it was getting to a point w ere it was where it needed to b fixed immediately. So it was f xed immediately. But we a tually set an alert based on t at trend. So the next time we s art seeing that slow, slow S c rve up, we can start going up, m ybe we need to restart the s rvice or rate limit some e regious API users or do some, y u know, shed some load that t at may or may not need to be s ed, right, like bot traffic o, or malicious people making r quests or so on and so forth. S load prioritization is always s per important. And the more t at we can have visible insight i to our systems, the faster we c n react to these, these p rticular incidents, it's b tter than the yesteryear of O, man, are patchy instances a e completely smoke, let's r start them all.

Ravi Lachhman: 27:51

The problem, please fix the problem

Bob Strecansky: 27:54

of Kubernetes to do that restart for us.

Ravi Lachhman: 27:56

Yeah. And we'll put that readiness probe to like one second later. Thank you. I think so like kind of, like, you know, I think get into last, like 25% of our podcast, we can talk about some intrinsic stuff. And like, Bob was like super stealler. Just like, hey, like this is the profession, and thi is how we move forward. I wan to talk about blameless cultur for a second. So like, definitely got blamed for a lo of problems. Like I kind o missed the blameless cultur portion of it. I was part of th blame culture. I don't know, Bo remembers my my severe inciden I had in our company, w together, I had to go to like tribunal and explain the outage right? Like, you know, it there's always like, what, wha was the root cause of this? Wh don't we talk? So one of th things that if you if you'r dealing with SREs, you're alway brought in at the worst. Like there's this romantic idea tha you're firefighting all th time. And that's supe stressful, like, no one coul survive, like, you know, what every time they're brought i the metrics are red, and like we immediately have to make revenue decision. But but a lo of stuff that Bob and his tea does. Bob is more of a senior like a more of a staff like h recently got promotion congratulations, like you'r helping for the thought o SRE is that for now? It's m re being proactive. But let's t lk about in that firefight, t e incident like, can you tell u a little bit or tell the v ewers or listeners a little b t about blameless culture, l ke, there's no root cause? And t at's very true. It's c mplicated. What what is what d es that mean? There's no root c use for blameless culture.

Bob Strecansky: 29:26

Sure. So blameless culture is an idiom that says, hey, don't blame this particular person for an incident that might have occurred to, you know, I'll use myself as an example because that's an easy one. Like, let's just say I'm clacking away, and I accidentally push code that does 1000 requests a second instead of 10. And I take down a service. So that's bad. It's very bad. And, you know, there could be business impacts for this. There could be emotional toil aspects of this. There could be all sorts of Other things. So classically, in a software engineering setting, I would get blamed very hard for this, like, Bob, why did you do this? Why didn't you test this better? Why didn't you, you know, x y,

Ravi Lachhman: 30:11

Yeah

Bob Strecansky: 30:12

Yeah, Bob! But what we, what we have to remember is that blame doesn't really help anything, it just gives the software engineer more strife. And it makes them less likely to perform actions that are going to better the company. So a blameless, blameless culture or is very important in software engineering these days to drive things forward. Very frequently, when we have incidents that at MailChimp, we do something called a post mortem. So we go back, and we discuss all the things that happened, leading up to the incident in order for the incident to happen. And we always do blameless post mortems. So rather than saying, Bob pushed out the code that made 1000 requests happened per second, rather than 10. We say, an engineer made this made a push for 1000 requests happen rather than a second. And that small, small change in words makes a big difference. It doesn't blame Bob for the incident that he caused. It allows people to be more objective about how to fix the problem. People aren't saying, oh, man, Bob broke this again, really, they're saying, it starts to help us to realize like, Okay, well, maybe we should make a check in our continuous integration software that says, Oh, you should, there's no reason we should ever be making 1000 requests a second in this particular for this particular call, we should all we should limit it to 100. And we should have, we should throw back an error in our continuous integration when, when somebody makes an egregious call like this in their software.

Ravi Lachhman: 31:42

That's huge. Yeah.

Bob Strecansky: 31:45

So having blame, having blameless culture allows you to be a little bit more innovative. And it allows engineers to be a little happier.

Ravi Lachhman: 31:54

It really like I really believe that right? Like, you know, it helps you build more resilient and robust systems to prevent problems, because it's one thing like, yeah, Bob for that. So like, you will think like, Oh, that's never gonna happen. Again, we need to blame Bob. And so that situation, you know, only Bob would cause that versus like, if you say an engineer did it, it could happen to anybody. So you're more like in tuned in a generic approach to take a more generic protection, right? Like, we're not just going to call off Bob's, like get access, you know, that would be the ultimate protection. But it's like, you take a more generic approach saying this can happen to anybody. And let's make it better for everybody like that, that was actually very specific to like, Blip, blameless culture. It's very funny, like, you know, those, those post mortems are so important. in Agile Software, there's a concept called retrospective or retro. So it's just bringing that like, it's bringing software engineering rigor into an incident, right? Like, hey, we were engineers. Why did our engineering discipline stop at the incident? Like, it goes out the window, sometimes it's like fight or flight, right? Like, blah, blah, to keep our job, blah, blah, you know, this person did that. Like having that like that level of discipline? Like you had discipline through 99% of the sdlc? Why did that 1% stop? Why did pandemonium occur? You know, and I think that's, that's very critical for blameless culture. Very funny aside, I was teaching a class Tuesday night. So we had about 30 users like so for a harness, like God, we're a continuous delivery platform. So I had about 30 users in like Singapore and Australia. Like we're showing them how to like validate deployments with Prometheus on a Kubernetes cluster. And so I was just like, as guinea pigs, I was like, I want to see how much like firepower, like, I want to make a smaller cluster because I had a very large Eks cluster. And I thought that I opened up another terminal window. And instead of running like a top command on the nodes, I actually ran the Delete command at Eks CTL. And it took the cluster down. So clearly, I wasn't one to blame, because everyone saw it on the like, you can see it on the screen recording like I'm like, Well, I guess class is over now. Because the cluster there's no post mortem on that one. It's just like Ravi. Ravi, an engineer did that. Y ah. I like it. I want to see a engineer. Track. Yes. Yeah. S like in the last couple a c uple of minutes. While we're w apping up. I always like to e d on like this one question. L ke, Bob, if you met Bob, you k ow, 15 years ago. What is like o e piece of advice? You know, J e and Bob coming out of his g aduate degree program at C emson. Like what would you t ll Bob

Bob Strecansky: 34:31

Go tigers, Go Ti ers. Now I tell I would tell young Bob to be brief. Make make changes that are going to help your business Don't worry about the Don't worry about resume as a service like don't care. Don't really don't try and implement the the new shiny thing just because it's a new shiny thing. be pragmatic in your delivery of new bits of software

Ravi Lachhman: 34:59

Not like

Bob Strecansky: 35:00

You should always have some sort of monitoring tab open whenever you're deploying new best bits of software. Look for your errors quickly. People don't mind if you make mistake mistakes if you remediate them quickly, and be confident in your delivery and make sure that you measure twice and cut once.

Ravi Lachhman: 35:19

That's awesome, Bob. Well, thank you so much for being on the podcast like very pragmatic advice. You know, definitely Bob. Bob is very skillful in the profession. You can catch Bob at local events and national events around sov work. And yeah, you know, just thank you so much for being on the podcast today, Bob.

Bob Strecansky: 35:36

Thanks for having me around.

Ravi Lachhman: 35:38

Cheers.

ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

Site Reliability Engineering 101 - Bob Strecansky - MailChimp

Dewan Ahmed

Ravi Lachhman

Chinmay Gaikwad