DevOps and Education, Closer Than You Think - David - GoSpotCheck Artwork

ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

ShipTalk is the podcast series on the ins, outs, ups, and downs of software delivery. This series dives into the vast ocean Software Delivery, bringing aboard industry tech leaders, seasoned engineers, and insightful customers to navigate through the currents of the ever-evolving software landscape. Each session explores the real-world challenges and victories encountered by today’s tech innovators.

Whether you’re an Engineering Manager, Software Engineer, or an enthusiast in Software delivery is your interest, you’ll gain invaluable insights, and equip yourself with the knowledge to sail through the complex waters of software delivery.

Our seasoned guests are here to share their stories, shining a light on the do's, don’ts, and the “I wish I knew” of the tech world. If you would like to be a guest on ShipTalk, send an e-mail to podcast@shiptalk.io. Be sure to check out our sponsor's website - Harness.io

All Episodes

ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

DevOps and Education, Closer Than You Think - David - GoSpotCheck

January 05, 2021 • David Sudia • Season 1 • Episode 3

0:00 | 41:50

In this episode of ShipTalk, we have a great conversation with David Sudia who is a Senior DevOps Engineer at GoSpotCheck. Prior to being a DevOps Engineer, David was an educator then software engineer for a large eCommerce site. Balancing long and short-term goals, sometimes moving a mountain takes time and sometimes incremental confidence can happen in a sprint. Learn from David in his vast experience how to move the DevOps needle in your organization forward.

Ravi Lachhman 00:06
Hey everybody, welcome to another episode of ShipTalk. Today I'm very excited to be talking to my buddy, David Sudia, who is a DevOps engineer at GoSpotCheck— or one of the DevOps wizards at GoSpotCheck. But for those of you who don't know David— David, why don't you introduce yourself to the listeners today?

David Sudia 00:24
Yeah. Hi, I'm Dave Sudia. I'm a Senior DevOps engineer at GoSpotCheck. I've been doing that for about four years; I have been in the DevOps space specifically in that role for about four years. And prior to working in software at all, I was a special education teacher for seven years.

Ravi Lachhman 00:42
That's awesome. You know, for some of the folks who missed some of the side conversation David and I had before starting the podcast, I wanted to name the podcast “DevOps Therapy with David”. [chuckles] I really like a lot of David's points of view; they're pretty pragmatic and pretty realistic. But what I kind of harp on: David's background. So David started his career as an educator— and a lot of us in technology… what I kind of see is that we're not very good at sharing information. There's going to be a couple of reasons for that: we are on a project for a short period of time or, you know, egos getting in the way and whatnot. But I was fortunate to catch David's talk at Unscripted, and it was like, “Oh, this is like very therapeutic, would love to get him back on the podcast.”

In your background, what I [want to] start out with is something a little more concrete; then [let’s] get into the abstract. Your technical background—you started as a software engineer, and then made your way over into DevOps engineering. How was that transition? How did you start making that transition from cranking out application code to [becoming] a platform engineering-centric person?

David Sudia 01:55
Yeah, sure. So, it was out of necessity. I was working at a company called Fanatics. And Fanatics is the e-commerce platform for licensed sports merchandise. If you've ever bought anything from any of the major shops from any major sports team in the United States, and a lot of international stuff, that comes from Fanatics. Fanatics was transitioning from all On-Prem, licensed Microsoft stuff, C# stack, into all AWS, open source, React, Node, Python and all that stuff. At the time that we made that transition—when I got hired, as that was happening, the team that I was on was one of the teams that was deploying into the cloud. I think they had three cloud engineers supporting 350 software engineers [or so]. The cloud engineers had no manager, they had no product person or task-organizing person. And so they were basically like, “Look, we make VPCs and security rules, and y'all have to figure out how to pull your stuff.” We would suggest that one person on every team learn it, and I volunteered to be that person, basically.

So that's how I got started. The first team I was on, we were writing a platform for running end-to-end tests. And we were writing an API that would launch the tests, so we would containerize the thing. And I just had to learn Amazon Web Services, basically, from scratch. The cloud team was providing support, but they were also learning what tools they wanted to support and their own best practices. So, there was a lot of “write this really long cloud formation thing, and finish it right as the cloud team decides we’re going to be using Troposphere, which is a Python wrapper on CloudFormation. And that's where they're going to provide their plugin and metadata about VPCs that we can grab or whatever.” And “Okay, we'll ditch the whole thing. Let's start rewriting in Troposphere and learn how to run the time- elastic container service.”

And yeah, so that was really it. It was just [that] I was the one who raised my hand. And I think that anything in [a] career in software development is [just] a new thing comes, and you raise your hand—and you just start doing it.

I gave a keynote at KubeCon a couple weeks ago, and someone reached out to me and went, “Well, I feel like I have to learn this, this, Kubernetes and containers, and this thing and that thing. And the other thing.” And I was like, “Look, I look really smart about this stuff because I've been doing it full time for three years. It's not because it's easy, or because you just go and pick it up, right?” I was like, “My main piece of advice is don't pick it all up. Pick one thing. Make a through-line of a critical path or something for yourself, because you just you just get in and do it.” And so that was my transition— I raised my hand.

Ravi Lachhman 05:01
That’s really awesome. It’s sometimes hard for folks that are in software engineering; really odd bottlenecks start to appear. Actually, my biggest outage I ever had (I had to go fly and apologize to the customer) was a VPC misconfiguration. I thought a CIDR was like a drink. And dividing by 16 is less than dividing by 8 or something like— like significant digit notation, I was wrong. Any network engineer would have said, “Ravi, you're stupid.”

David Sudia 05:26
But I think you're asking me a question about failure later. And mine is also related to CIDR blocks.

Ravi Lachhman 05:32
[Laughs] I should crack open a cider right now. But yeah, that's funny— or not funny, but in traditional software engineering, like let's say the day-to-day developer, the metrics are kind of changing, right? Someone who has a sense of ownership— you feel like you want to see your stuff in production versus I committed code, it's an external party’s responsibility now to kind of take that forward. And from your experience, is there a sense of ownership?

David Sudia 06:08
Yeah, they were definitely trying to inculcate that sense of ownership. Because even as that cloud team grew, the model is still “every team needs to own their own deployment.” And we help, and we provide templating and resources and that sort of thing— but they weren't owning the deployments for any of the teams. And that continues through how things are at GoSpotCheck right now. I generally believe in people owning the lifecycle of their code, but to a certain extent and to a certain level of scalability.

There’s a problem— before we started, you were saying you're cynical about full lifecycle development ownership, and I would share that cynicism in the sense. So there's another direct metaphor to education here, which is that over the last couple of decades, classroom teachers what you think of like “My third grade teacher,”— those teachers have had to become, if we're going to get every single student into college, reading specialists and math specialists and intervention specialists… and they have to know how to support every sort of disability that a kid might have in their classroom to a certain extent.

The demands put on classroom teachers is this direct reflection, for me, of the demands put on software devs right now. There's a very similar trajectory of “we used to have the QA team, and we used to have the Ops team, and the SRE team… but now it's all you. You are the person who writes it, the person who QAs it, the person who deploys it, the person to make sure it's running correctly, and the person who you know jumps in if the page goes off and fixes it and re-deploys. The whole thing.” And I think that there is an idealism in that. And I agree with the ideal that the people closest to it should own it.

At GoSpotCheck right now, I'm a big fan of acknowledging— it's very easy in podcasts and talks to be like “look at all the things we did, right”— I'm a big fan of acknowledging where you're not hitting the ideal. So we have this big application started on Ruby on Rails (and we have a lot of other stuff now) but we have still remaining a pretty large Ruby on Rails monolith, and it talks to a Postgres database. And over time, we ended up building this distributed monolith around that monolith, and there's some funky stuff going on. We have definitely been subject to the pain of “there is a big Postgres database that a lot of things talk to, even multiple things writing into it.” And my team still gets the pages for things going wrong with the database. That means we primarily get the pages for things going wrong with the monolith. Even though there's a team of developers that work on the monolith, we're not 100% there to this sort of ownership-of-code model.

[This is] largely because those developers have to talk about sharing, right? Like, I hopped on a page the other day and it was about the database and some issues going on with it. I was like, “I am going to make a very performative show here for everyone that is going to look like we're responding to this until my boss gets on— because he's the one who really knows about.” What I'm trying to get [at] here , essentially, is [that] it's great to want to get there and you can't have all of the knowledge about a system in one person. Like, my boss is the person who has the most knowledge about that data. But when we get into the Rails code that is causing an issue, he's got to go talk to the Dev, even though he has spent a lot of time in that Rails codebase. And then, heaven forbid, it's something to actually do with the infrastructure around it— I'm the person who knows the most about the clusters. And even though my boss is very knowledgeable about the clusters, if we get into an edge case of them, of the horizontal pod auto scaling or with the probes or something, he's got to come talk to me and be like, “Dave, do you really understand this part of it?”

And so, where I've seen that ownership model be most successful is either in really small teams, where there's five people who own the thing, or if you acknowledge that a team can own the full lifecycle of something— and that team has a QA person, an SRE Op-type person. And all together, they're very tightly knit, they can communicate very effectively, right? The DevOps model was never about “and one person does everything for something.” It was about breaking down communication barriers between Development and Operations. Developers should not be operating Kubernetes clusters; that should not be something they have to hold in their brain, you know? But can you put the entire chain of Dev and Ops on a team? Yeah, you could totally do that. So I get less cynical about full ownership when you talk about a team of specialists fully owning something versus “Yeah, we're going to force one person to know literally everything there is to know about operating software.”

Ravi Lachhman 11:51
That was probably the best explanation I’ve heard, ever. Before the call, I was like, “I'm super cynic about full lifecycle developers…” [chuckles] I think there's this notion like, “it's not pragmatic”; you hit the nail on the head. I came up with this term called “the fog of development.” It's like the fog of war; it's all about situational awareness. And you painted a very concrete picture of that— someone knows the database, someone knows the infrastructure cluster, someone knows the application code in Ruby. It's totally unfair for one person to do that; software is so complicated. Making a change in something [and understanding] how does it impact something else… you can make your best guess, but until the fog is cleared— which it might never be cleared— you can't really know the impact until you actually try something.

David Sudia 12:37
Yeah. And the reason why I know very few people do that is because it's incredibly expensive to have a QA person, a developer, UX expert, to have a fully integrated stack of people— and they don't all scale equally. Like, a UX designer might only need to put 20% of the time to do a project that the software developer does, so then you it becomes hard to— and here's Conway's Law coming back of, like, it's very hard to structure teams exactly the way you would want to [and] to create the systems that you would want to see. So that's certainly the tricky part. I'm not going to sit here and claim to have a really great answer for that, other than “being aware of it” and sort of harnessing it. And that was the talk I gave at Unscripted, and I hit on this in my KubeCon talk. It’s if you can kind of harness Conway's Law around sharing and communication— and if you can get cross-disciplined teams together to even make more centralized decisions, and inform those—then that can help with that as well.

Ravi Lachhman 13:52

Yeah, that's perfect, I think. Just being really realistic about “people are people,” right, and there's always Conway's Law at play, to some degree. We are skilled in certain ways, and we will tend to gravitate to things that we're skilled [at].

Another question— I like to call this DevOps therapy with David, so more therapeutic questions—is something that I connotate the “platform engineer's dilemma”. [It’s] that as a platform engineer, you’re switching over from, let's say, software engineering or platform engineering; you're starting to make platforms for other engineers or other people to use. So it's like any other product design/product designer platform engineering dilemma— how do you kind of get across your opinion with taking in other people's opinions or vice versa? How do you fend off, like, “Hey, these opinions are too strong or too weak.” How do you build consensus in a platform?

David Sudia 14:57

Yeah, I think getting your own ego out of the way and having patience. There's ego in everything right, and there's this constant discussion in this field around, for software developers, it's the conversation of “you are not your code, comments on your PR and not a personal attack, or asking you to change things…” Those kinds of conversations. This thing came across my Twitter feed the other day, it was this really funny. It was this person who was like, “How do you frustrate developers? Oh, wait, I solved it.” And a lot of people missed the joke and tried to provide answers of how you frustrate software developers. But they're my favorite.

What I really loved about that, what I reflected on is there’s this ego that we have, like, the assumption there is that it's going to frustrate developers if you ask a question, if you pose a problem and then say you solved it— partly without providing the solution, but the other piece of it is this sort of thing of like, “Wait, I wanted to solve that problem. I'm not going to be happy unless I'm the one who solved it.” And often [it’s] not even a malicious “because I would have solved it better than you.” But just this like, I have an innate “I'm driven to solve problems.” I think that's one of the things that almost all of us truly, deeply share as a personality trait. “And I didn't get to solve that one.” And so being able to step out of that, and not having to be the person to solve the problem— I think the most valuable thing that my education background gives to me in in this field is that I spent [the] initial seven years of my professional career and time in a highly collaborative environment, where there has to be consensus on everything. Everything takes place over a number of years to make decisions, and there's bureaucracy.

And that's valuable because I got into very heated conversations; I started in that career as a very opinionated, sort of hot-headed person. I had to temper that back to like, “Oh, the goal, what's really my goal? Is my goal to do it this way, or to have the thing at the end?” I had to really start reframing my approaches to meetings and efforts around, “I'm emotionally attached to there being an outcome, not the outcome being the one that I thought of.”

And that goes into the thing I said in the Unscripted talk, which is approaching building platforms as a product. A lot of a lot of places put a bunch of bright engineers together and build a platform, and it's not driven by any sort of user needs or user input. And not everywhere—one of my favorite talks [at] KubeCon North America 2019 was from Pinterest, where they were sort of building an abstraction layer over Kubernetes objects, like “deployments are really complex, but we know what we want 80% of it to be.” So they made a CRD, called a Pinterest Deployment, and it shrunk down what developers had to do. They built their own UI interface over Spinnaker to make it a little friendlier for their purposes. So there are places doing this; it's not like I just came up with this brilliant idea myself. But it is, I think, rarer than not; and so, treating it like a product— and if you're treating it like a product, that means you have to have user input, and they're really going to make the decisions for you— you're building a thing that meets their decision making, right?

At GoSpotCheck, I don't have a product manager; we're not actually able to be that professional about it because of resource concerns. But what I was able to do is like, “alright, I'm going to be the library.” The way that I get my knowledge and my opinions into this process is to know that I am respected as a person who is very knowledgeable and good at these things… but no one's going to be like, ‘Dave, just build it however you want.” The answer then is to get people together, be the library of information, let them collaboratively come to the decisions and the outcomes, and know that the value you had in that process was providing the information— not necessarily being the super smart person in the room that came up with all the best answers.

Ravi Lachhman 19:40
Yeah, that's awesome. And the point that I really enjoyed was, treat it like a product, right? A product is a two-way street. You need to get feedback because I’ve definitely seen like your anti-pattern example, that people go off and build something and then that's it; this is the way it's going to be until the end of time, because it was too painful to build out the first time. So we're never going to touch it ever again. And another really salient point there would be the internal versus external customer argument. No matter who your customer is, they don't care how you did it— they care what you did. And so that really resonated well with me, like, “Hey, the outcome is what's really important.”

Another thing that we can talk about is technology change. Something we talked about a little bit before the podcast is going through a technology transformation journey. I think something that David brings that's amazing is a lot of patience, like seeing a lot of long-term goals. I'm prone to living one sprint at a time. I used to not be like, and then with this agile revolution, or whatever you want to call it [chuckles] I've slowly become sprint-driven to this day.

But let's talk about that. Like, do you see that there's other things in technology that can't be sprint-driven, or [that] there should be some patience? Technologies can go, go, go. But where do you kind of pump the brakes and take your time?

David Sudia 21:15
Yeah, actions happen in a sprint. But there are so many actions that are required to make organizational change happen that you have to measure success. I think the difference is, how do you organize your tasks versus how do you measure success. The way that we organize our tasks in this field has an innate pressure on measuring success in the short term. Like, “what are we getting done this sprint,” you can extend that out to “what are we getting done in this quarter”. And there are a lot of things that are not helped by the fact that there are a lot of things you can do in software development that only take two weeks… and there are a lot of things that you can build in a quarter. And that makes it really frustrating to hit the things that take multiple years.

We're three years into a migration onto Kubernetes from a Platform as a Service, and we still have a couple of long tail things left there. This morning, I was working with a team [where] I helped them get the next service. They're getting out all prepped and ready to go, and we got it in staging. And I was like, “Cool, alright, I've got it on my sprint, get this thing migrated.” And I was like, “Alright, cool. We've got staging done, I we've tested everything; we're pretty confident about production, you just let me know when you're ready to go for production.” And they came back and [say] “Oh, well, yeah, we're not doing that till January.” And I was like, “Yeah, okay, uhh… I wanted to check this off. But now I guess it won't.” I just want to check it off! But at the same time, one of the things that helps me in just getting over my need to check it off is that we're at the end of year three. And we're continuing to make progress, and we're going to get there.

I think another example is just shifting from one observability stack to another, or any anything where there's going to be a lot of code change across a lot of projects; anytime you're trying to change behaviors or habits. When I was a special ed teacher, I was specifically worked with kids with behavioral disabilities, and I was a behaviorist. And so the other big thing I carry over from that career is [that] I get a kid in kindergarten, and I would really hope that by the time you left fifth grade, we'd have made good progress, measuring success across six years. And in the bureaucracy of the public education system, you come up with a really good idea for how people could do things, and maybe five years later, it's getting implemented and the pilot. That's one extreme.

And then I think software has the other extreme of “I came up with a good idea. It's in prod tomorrow.” You can do that! And so the reality, depending on what you're doing— there are things that can go into prod tomorrow, absolutely. I don't think everything has to be a five-year project approved by six committees. But there are things where, if you're trying to change the behavior of 300 people, it's hard. It has to be approved by lots of [people]; you have to get input; you have to get buy in.

I think there's an inherent frustration in that process of trying to guide organizational change that will take years. But where the sprint's come in is that all you can do for that change is that you’ve got to keep chipping away at it— slowly, slowly, slowly, “this week we got this team to move that app.” In doing so, we came up with a bunch of new best practices that we should really port backwards at some point, but I'm not going to try to go convince everyone to change again. It's just about having the patience to acknowledge that things are going to take a long time and to balance that instant gratification that's so inherent in this industry.

Ravi Lachhman 25:31
Ah, I think that's perfect… more therapy with David. Certain things take longer, things don't. I remember working on a project for the IRS, and we were building things to find tax under-reporters. And that had never been done before, and we rejected one of the last Waterfall projects I've worked on because we're like “we're doing everything. We're creating from scratch.” It’s that we have no idea how long any of this is going to take, right? And so yeah, that project got a lot of long-term goals versus another project we were on two-week sprints. It's like, “What?”

David Sudia 26:09
I think it probably is different by industry. And I think when we talk about software engineering and software development, we tend to think about the startup world, right? And the exciting, sexy part of software development. It's like all the Hacker News posts where you can't read this too much, because otherwise you start to think that everyone is Greenfield-ing rust everywhere.

Ravi Lachhman 26:32
Angry orange paper, you mean? Hacker News? [chuckles]

David Sudia 26:35
Yeah. There's a lot of comments there. If you read this, it tends to centralize startup stuff, and big funded startup stuff where you can afford to have a team of people who invent a database just for our personal specific use or something. And then there’s open source. And yeah, I have a feeling that if you talk to your average .NET engineer who's working for a bank, and has been for 20 years, they're probably going to have a different perspective on all the things I'm saying right now; and it’s probably a healthier one. But yeah.

Ravi Lachhman 27:08
Yeah, definitely. Perspective, organizations— all different. And especially in the industry that I’m in, we can be greedy. We do a lot of that startup stuff, like, if it's not cranking out every couple weeks, you're wasting time. And it’s differences on levels of what happens if you get something wrong. I was having a conversation with someone before; they were actually in charge of the check-cashing application for Bank of America, where they process over a trillion dollars checks. Their change control process is crazy compared to like, “Oh, yeah, your picture’s not going to get uploaded with the filter”, right. But you know, who cares after the economy will stop tomorrow?

David Sudia 27:51
[laughs] Yeah. A lot of A/B analysis in that process, for sure.

Ravi Lachhman 27:58
Absolutely. As we’re rounding out some of the podcasts here, coming to the last questions and always the fun ones to ask. It's about times when things didn't go right. So, we have some common tissue that CIDRs caused us pain, but can you talk about one or two times that— and you don’t have to give too many specifics— things didn't go right, and how did you bounce back?

David Sudia 28:22
Now I'll give you specifics. So I come from the software side, I'm not coming from the Operations side when I became a DevOps engineer. I went through that journey of Fanatics, and then I basically got headhunted over to be a DevOps engineer. And on the team that I was on, the other two people come from a more traditional “installing racks”, sort of operations background. I was pure software, originally coming out of a boot camp; so, I was trying to apply the best practices that I learned from the cloud engineering team at Fanatics. And I was like, “Right, so we're going to have dev stage prod, they'll all be completely separate, they're never allowed to talk to each other. And to know that they’re as identical as possible, we'll set up the networks all exactly the same, and the infrastructure will be pretty much exactly the same.” And so, for our dev and staging and prod production networks, all the CIDRs had 100% overlap. I don't want to say I walked in with this with an “I know that this is right”. Memory’s a fuzzy thing, but I'm pretty confident that I was like, “Please push back on me here.” But also, I'm someone who gets invited on podcasts because I talk very confidently about the things I say. And I've learned I have to be like, “I could be wrong. Please tell me that. If I am wrong, don't take what I say as gospel.” I am happy to be told I'm wrong. And so I was allowed to go through with this plan. And I executed it.

Then I was talking with our data pipeline folks, and they're like, “Yeah, well, we don't have dev staging prod; well, we do, but it's all within one set of VMs that run and that's the way vendor and stack and stuff [we] were using… like it's just one VPC.” And then we reach out to dev and staging, prod and store are in different places, the pipeline's all in one place. And I was like, “Oh, so if I give you an IP, if we hook these networks up together, that's not going to work. Because you have one VPC, and I have three VPCs, and the CIDRs are all exactly the same. Oh, then we're gonna have to rebuild everything.”

“Like we're going to have to take, we're going to have to rebuild everything. We’re going to have to start back the network layer, and then completely rebuild our VPCs, and then the subnets within them, and then redeploy our clusters… which means that every single application is going to have to be redeployed to all three environments. Oh, I have created a lot of work for people.”

And so that was, you know, that was me having to go and basically nutshell everything I just said to five development teams, and be like, “Yep, so basically, I'm going get everything ready. And then we're going to have a day where everyone has to add all these new deployment targets and completely redeploy. And then we're going to have to roll over all the DNS. So basically, there's going to be maybe a day or two days where you can't do anything but help me with this. And it's my fault.”

And yeah, I don't know. I mean, what are the lessons I take from that? One of them was make sure that people understand that you don't know everything, and you would really appreciate if they told you if you're wrong— and would like push back. I guess another was really making sure that you go and learn and understand things more before you make really fundamental decisions. Like I really wish I'd gone and read a book about networking before I did that. I don't know. I may have still even made the same decision then based on the philosophical principle of “Yeah, they’ll be separate. They'll never talk.”

Ravi Lachhman 32:22
Yeah, they would never talk. I believe it. [chuckles]

David Sudia 32:25
I don't know. That whole experience has also made me a huge fan of maps— of architecture maps, service maps— that I have had a difficult time convincing everyone else of the value of that documentation and of constructing it before you start doing things. Because then I would have known that this thing over here would have had to talk to all three, and that might have influenced my decision.

Ah, yeah, I don't know; I learned from it. That's the best that I can say. I'm never going to make that particular problem or mistake again!

Ravi Lachhman 32:54
Yeah, that’s challenging. You had to go back and go back to the different teams— but hey, it's a lot of great learnings that came up from that. We've all been there sometimes, like, “Oh, boy, how do I tell this person this” right.

David Sudia 33:10
As long as you're straightforward and own it, and you're not trying to blame other people, and you just take responsibility for your piece and mistakes, and move on— and you're in a healthy culture that accepts that, which I think is the last big piece… but yeah, everyone was everyone was okay. They understood.

Ravi Lachhman 33:27
We all make mistakes, like we are all learning. Yeah, that's a very important thing in modern software development, you're doing stuff for the first time all the time. And so you're going to get stuff wrong. It's given [that] you’re going to get stuff wrong, and iteration is key.

David Sudia 33:42
A whole podcast could be spent on all the bad decisions I've made in Kubernetes. Like, “oh, oh, well, now I know about that thing. That has always been there. But it was one of the 700 things you need to learn, and I just learned about it.”

Ravi Lachhman 33:54
That's hilarious. One of the things— probably [will] spend one minute talking about this, and then I’ll ask you the last question I ask everybody. What I've seen in the Kubernetes ecosystem (I'm coming from a very stringent Java development background in distributed systems, and then kind of platform engineering, the stuff I do now) is that with platforms like Kubernetes, the only constant is change. Your approach that you took a year ago might change this year.

What would be your advice to someone jumping into that? A lot of times before Kubernetes, the paradigm woulnd’t shift so fast; but kind of a good example is service mesh. Like, we didn't implement it a year ago; now we have one.

David Sudia 34:45
Yeah, so I gave a whole talk at Deserted Island DevOps earlier this year. You can go find that, called "If You Can Wait Six Months, You Should". Because I actually think the answer is to wait. It's to not adopt things. It's to shrink down the scope of what you need to the minimal set. And I think in that talk I said infrastructure is also an MVP. It’s not quite as much of an agile MVP as you can make like a brand-new Greenfield software project. But I think treating it like “what are the what are the least number of things that we need for this right now” is really valuable— because the rest of the world is treating infrastructure like an agile MPP type situation right now. And my favorite announcement from KubeCon is that they're going to be slowing down releases. They're slowing down!

But yeah, service mesh is a great example. Three years ago, I was writing my own Envoy API implementations, and I got about halfway through and they changed. They went to v2, and then I had to start over. And then by the time I was done with that, Istio was a thing. And then by the time Istio actually felt approachable and was ready to go, LinkerD was like LinkerD install. And then you have a service mesh.

So yeah, things are advancing so rapidly. We held off on adopting a service mesh for a long time. It was like, do we really need it yet… Do we want it? Yes. Do we need it? I really want it… but no, we don't need it. And there's sometimes hard, oftentimes emotional decisions to make, especially because it's really easy to feel like you need everything. And if you actually want to be doing this, right, you've got to be doing a service mesh right now. And like, “Maybe, maybe not.” You’ve got to take it situation by situation. But by the time we really adopted one, yeah, we really needed it. Then we could make a better choice, and things had come along so far that it was incredibly easy— as opposed to “let's go spend six months writing our own Envoy API implementations.”

So I think that's the real crux of it: just like slow down. I don't know what most of the projects in CNCF are now. There was a time when I knew the names of all of them; [now] I don't even know the names, much less could tell you what they are and what they do, because there's so much happening. And it's so fast.

Ravi Lachhman 37:18
There's 1500 cards now in the CNCF Landscape, so…

David Sudia 37:23
Yeah, and my qualification on that is that most of those are vendors. There's actually not that many projects, but yeah, there's a lot of cards. And getting into vendor analysis is a whole ’nother discussion. But yeah, there's so many pieces, but you don't need all the pieces. You don't have to have all of them; you can just focus on this. I mean, the auto scaling, just to zero in on another one thing. Just yesterday, we were trying to— I don't remember what version of the auto scaling API we're on right now across a couple of different deployments— but we were talking about one of them. And I was like, “Oh, yeah, and we should look at what version of the auto-scaling API allows you to set elasticity settings like ‘Don't scale, if you've already scaled in the last two minutes’, because those setting for a long time were at the cluster level.” Yeah, another in the auto scaling API, but only when you get to this API version, which only becomes available in this Kubernetes version… there's too many things to track.

So the number one thing is to shrink down the number of things you are tracking, and then you can add anything else. Check in twice a year and be like, “Where's that at right now? Okay, it's in this stage? Is that easy enough for us to use? Yes? Great. No? Check it out in six months.”

Ravi Lachhman 38:54
That should be the model everybody follows. But unfortunately, there's a lot of chasing going on. But hey, I love that advice. Print it on the wall: wait every six months. Go back and check it again.

David Sudia 39:05
I get it because the chasing is addictive. You know, I was the number one person at my work going “but we need a service mesh, we have to have mutual TLS. It's a security thing.” And everyone was like, “Yeah, but the things that are there aren't live yet. Like we don't have security concerns because there's no data.” And I was like, “Oh, yeah, okay, I'll buy that. All right. You're right… but I really want to play with this!” It's addictive, and it's overcoming that addiction. It's another reason to source the decisions out to a lot of people. They will tell you and to find the people who will talk back to you and be like “No, you're wrong.”

Ravi Lachhman 39:42
[Laughs] Well, awesome opinions, always solid advice.

Last question for you: let's say you ran into like young David fresh out of school with current David on the street. What would you tell your past self? It could be any set of advice: technology, life, don't eat that--“wild berries are dangerous.”

David Sudia 40:09
Life is longer than you think. I think that's the thing. And I mean, we could all go tomorrow, and there's that piece of approaching life. But the other one is just I could not have predicted then where I would be now. I couldn't have predicted where I am 5-10 years ago. I feel like when you're in your teens and in your 20s, every year is so much more of your life. This is really cliche advice, I don't know, but everything feels like if I don't do this now, if I'm not… I don't know, I don't care as much about Keeping Up with the Joneses now in my 30s as I did in my 20s, even though you think I would feel it more. It’s like there's time to shift careers and shift perspectives, and you will not believe the things in 10 years that you believe now because you'll learn new things. Yeah, that’s it.

Ravi Lachhman 41:14
That's good! Yeah, life slows down, you know, even in engineering ranks. I used to be a lot more angsty. If I look back at younger Ravi, and then it's like, “Oh, really?” When new ideas come up, you're not angsty to jump on anybody. It’s like, okay, let me hear it out. You’ve kind of had that experience that you need to hear what people are saying.

Awesome. Well David, thank you so much for your time on the podcast today. Hope to see you around, and hope to catch everybody else in the next episode of ShipTalk.

David Sudia 41:42
Thanks for having me, appreciate it.

ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

DevOps and Education, Closer Than You Think - David - GoSpotCheck

Dewan Ahmed

Ravi Lachhman

Chinmay Gaikwad