What Can DevOps Teach Us About DataOps? A Podcast with Laird Popkin
In the latest episode of the Designing Enterprise Platforms podcast from Early Adopter Research (EAR), EAR’s Dan Woods spoke with Laird Popkin, who most recently served as CTO of the Hearst Transportation Group, which is a group of B2B companies that provide data and data services related to cars, trucks, and airplanes. Popkin previously worked at Kaplan and also helped to create a company for large-scale peer-to-peer data synchronization called Pando. Their conversation explored two main concepts for CTOs, CEOs, or CIOs trying to reconstruct their data platforms: DevOps and DataOps.
Listen to the podcast or read an edited version of their conversation below:
* 5:35 – How Popkin implemented DevOps successfully
* 15:30 – How to break down silos
* 21:30 – The challenges of DataOps
Woods: If you look at what DevOps has accomplished, there are two levels that you have to talk about. One is the cultural and one is the process level. And the process level was all about taking the separation from the operations team and the separation from the developer team and bringing those together so that you had one team looking from the beginning of the process all the way to the end, and then the part that was all about process focused on automation, streamlining it, and providing new tooling and new levels of automation. The cultural part of it was looking at the whole system from the perspective of the end user and what experience they were having. You would prioritize whatever you were doing to make sure that their end user happiness was optimized. The idea is that the most advanced DevOps situations I’ve seen is when somebody can press a button and the entire system is rebuilt from the source code repositories all the way to being deployed in production. What do you feel is the right way to define DevOps?
Popkin: I see DevOps as a transformation around culture. And that’s supported by a transformation around process and how people work together. All of that is enabled by tooling. The end goal is that you can operate without fear and blame, you can move fast and at low risk. So it’s an interesting dynamic, where you can actually get stuff done faster, cheaper, and higher quality, when traditionally the argument is you only have two of those. But if you do it right, there’s no magic to it, it’s just that if you invest across the whole spectrum, the technology enables better processes, the better processes enables the better culture, and so they accelerate each other. The first business unit I focused on inside Hearst Transportation was a command and control, top down organization, where the technology processes were all very manual. Whenever we did a new version of software, the release process involved manually building stuff, manually copying stuff, manually editing config files, and so on. So every release was very scary and high-risk. Because everything was very fragile and high-risk, they did it very infrequently. You had small numbers of huge, scary releases. Because of the fear, before any release there was a meeting, and if you can picture a bunch of people playing chicken, all looking at each other seeing who would flinch or was the release okay to go out. And so they’d have that meeting and then over the next day, they’d do all the work to get prepped for the release and before that, there was probably a two-week QA cycle. And then afterwards, they would do the release but the release was a big, long manual process and then afterwards they’d spend a day manually testing everything to make sure it still worked afterwards. The whole release process was terrifying and incredibly labor-intensive.
Woods: And infrequent.
Popkin: And very infrequent so the product velocity was vanishingly small because to build a feature was super expensive, to push the feature out was very hard, and so on. I couldn’t even say what the release cycle was. By the end, a year later, everything was completely automated with test coverage, for example. And so my success story there was somebody in customer service got a dealer call that the software was acting a little funny in a very old version of IE, the developer was in an open plan space where he was maybe five feet away from the customer service person, he heard that they were talking about a problem and said, “Oh wait, I forgot to test blah blah blah, ancient version of IE in my JavaScript.” He was embarrassed, added that to the test coverage scripts, fixed his bug in the software, committed it, which automatically triggered a build, did a Docker deploy to the test environment, ran the test coverage, passed the test coverage and they pushed it out and it was six minutes from the phone call to the fix in production and it wasn’t an emergency. Continuous integration, continuous deployment, and DevOps enabled that.
Woods: Well, this is exactly what we try to cover on the Designing Enterprise Platforms podcast, which is how you make transitions to higher states of operational quality and performance. At a high level, can you talk us through the steps you went through to introduce that, from starting from scratch and then getting to that end state that you just described?
Popkin: So the process I used at that company was that I split the dev team into two teams, track one and track two. And track one kept the current stuff working, bugs fixed, and so on, because you couldn’t ignore that completely. And track two was—we actually looked at could we incrementally get there through going from .net X to Y, et cetera, and ended up deciding that it was more efficient to do a ground-up rewrite.
Woods: Did you rewrite the services as well or were those services wrapped inside of and stayed .net?
Popkin: Great question. The application was HTML frontend, C# service side, and the new version was NodeJS service side, Angular frontend. And we rewrote both. And the reason we picked NodeJS and Angular was Angular is super popular and there was a lot of community support for it. But a big part was that way everything’s JavaScript everywhere, which meant that I could have developers across the whole stack. So we used MongoDB, which is essentially JavaScript storage, you could think of it that way. No JS on the service side, Angular on the client side, and so therefore any developer could work on the entire stack so I didn’t have to have a DBA team, a C# service side team, and a JavaScript client side team. It eliminated all the friction.
Woods: That broke the silos between dev and ops?
Popkin: Yes, exactly. One of the disciplines we had was that I collapsed everything together in a couple of silos. One was there was no such thing as database versus service side versus client side. If an engineer took a story to own, they owned the whole stack, all the way through. And that included tests. I eliminated the QA department because of the handoff. Instead I made all the developers responsible for writing their own test code or pairing up with another developer because it was often healthy to have somebody else write test code, because that way you catch assumptions. That really reduced a lot of the friction. To force the issue, we worked on a weekly sprint cycle so every week was sprint demo, delivery, push to production every week.
Woods: You were basically able to build a bit of the whole pipeline and then a bit of the new software and then eventually the whole software?
Popkin: Exactly, I’m a big believer in a tracer bullet. Rather than working on a little bit of service side code, I picked one use case all the way through database, server, client.
Woods: And then all the tooling that needed support.
Popkin: Yes, which meant automated builds and we used CircleCI, Docker, Amazon’s ECS, which is their elastic container service, and these days I would use Kubernetes. We brought in a couple of smart guys to help us get a lot of those decisions made and a lot of that set up. We spent the first couple of weeks getting people to wrap their head around the tech stack and getting automated build working. Most of that heavy lifting was by some smart consultants who were deeply experienced in that arena. But everything was paired. They never went off and did stuff on their own and handed us the results. And with the two teams that I talked about, every week I rotated people back and forth. Nobody was on the maintenance team forever. They might be on a maintenance team for one week and then the next week they’d be on the new code, new product team. What you don’t want to have is some guy to think, “Oh, I’ve been relegated to maintenance mode,” and therefore they’re doomed. And an interesting thing happened over the course of a couple months of ping-ponging back and forth all the time. Some people loved the new world and some people liked the old world. People sorted themselves out so I ended up having a self-selected maintenance versus new product team based on the engineers’ feeling comfortable and picking what they wanted to do.
Woods: If you start thinking about all the goodness of that DevOps, you get the breaking down of the silos, creating one unified team, dramatically increasing automation, thinking about the end result and what’s the effect on the end user and using that to prioritize the work that you do. Now we’re entering a world in which people are realizing that they have the same sort of problem in these data pipelines that they create and the data pipelines never were designed with your application in mind. It’s a messy environment. But there’s another aspect to DataOps that I think is important and that is, you’ve got not only the pipeline but you’ve also got the fact that what’s going through the pipeline is actually valuable information that needs to be governed and protected. And the governance and protection of that also has to be included in the entire DataOps. There’s not CircleCI for the DataOps world yet but what you want underneath are components that can be automated and respond to declarative automation so that you don’t have to be constantly having 10 DBAs running around changing databases all the time in order to keep up with your stuff. What’s your experience of what people are calling DataOps?
Popkin: There are definitely a couple of levels to it. One is that there’s the software that’s doing the data work, so the data science code that’s doing modeling and cluster analysis. But then there’s also the data flowing through that platform and so you have to think about both levels. So what’s the lifecycle of my data science software, which in my experience, at most companies that’s not very mature. I will say that data science can be amazing but it’s not consistently in source code control, there’s not always versioning. You have this maturity that’s been in place for a while and around data science, a lot of times what you end up with is brilliant data scientists have stuff on their computer that they run. And there may be a copy somewhere else that’s deployed operationally but a lot of times it’ll be just on a box in the data science team that’s not really managed operationally in the same way as an external-facing thing. They’re sort of a little less managed. And so you need to manage lifecycle around the data science and ETL and so on and code but then there’s also the flow of the data through all that code. And that’s not something that software development teams usually think about unless you’re building something like a content management system. You don’t press a button in the DataOps and then you’re done. The metaphor is you’ve got an airplane in flight and you don’t get to just stop the airplane and swap the engine. You have to figure out how do you put the new engine in there but leave the old engine running because the data’s got to be flowing. In some sense, it’s a bit like how do I keep a cluster of 100 servers running when I’ve got new versions of software getting rolled out and deployed out and how do I keep the versions working together so that my application is running consistently?
Woods: So the challenge of what doing DataOps is doing, namely how do you break the silos, how do you increase automation, how do you get a bunch more people working together, is a much broader problem because there are so many more people and systems involved.
Popkin: In the data world, yes, there are more silos because you’ve got, for example, data science and/or analytics and they use a completely different set of tools and they view the world completely differently from the operational people. What I see is that the data scientists come up with algorithms, they’ll validate them internally, but then they’ll hand the algorithms over to a completely different team to re-implement in a different language in order to deploy and operationalize, and there are huge friction costs there.
Woods: Have you run into a situation in which there was a very large data lake or data warehouse, that then had to have a pretty broad and distributed set of governance over it in order to allow people in a controlled manner to get access and inspect data?
Popkin: For sure. One of the real challenges I’ve run into is that you want to have a data lake for people to noodle around and try stuff easily. Like if you’re a data science team or analytics team and you want to get weather data and play around with it in combination with other stuff. That can’t turn into a three-month long ETL process. It’s got to be super easy because you’re just trying ideas around and you want to have an idea on Monday, try it on Tuesday and decide if it’s a good idea or bad idea on Tuesday, not put in a queue for an ETL team to do next quarter, right?
But at the same time, if you’ve got core data that you’re delivering to your customers, like what’s the valuation of a car that insurance companies can use for determining insurance pricing or the bank’s going to use for lease valuations, you really have to have absolute lockdown governance around that data because the business impact of getting that wrong is very high. You have to have a data lake and have it be very fluid for some purposes but have the refined output product absolutely reliable and locked down. You have to be careful to maintain that, and I found it was very helpful to put a lot governance around access control and process but also building QA into the tooling. If a vendor changes their file format or they had a problem operationally or what have you and all of a sudden I’m getting a bunch of zeroes and I should have been getting dollar values, I want to know that in minutes, not because some report looks a little funny two weeks later. Getting that early warning, even though it requires a little bit of effort to build these sort of assertions in, it’s not that hard and the value’s huge.
Woods: How would you go about doing the same thing with a DataOps environment that you did when you implemented DevOps?
Popkin: It’s a lot of the similar concepts—that you want to have continuous integration, automation, for example, if a data scientist tweaks an algorithm and does a commit into source code, that can automatically build, deploy, and run through tests just like application code. Obviously some of the details are different, like you want to have test sample data sets to run on and validation around outputs and that sort of thing. But all the principles around source code control and test validations and automation and release process and so on are very similar. And then again, using things like Lambda certainly makes it a lot easier too because it eliminates a lot of the heavy lifting.
Woods: What advice would you have for people who are thinking about trying to understand what DataOps is and trying to find their own ways to put it to work and see if it works?
Popkin: I’m a big believer in automating everything possible just because the end result is such a friction reducer. Obviously, if I’m noodling around on my laptop, just sort of trying stuff out, I’m not going to worry about automating everything. But once I’m doing kind of a real project, it’s worth having some test code, it’s worth having automated builds and deploys. Once you’ve done a couple of them, it’s very little extra work. If you do a little testing over obvious stuff that’s very, very easy to write, you still catch problems much earlier than if you didn’t do that. And then if you can align the data science and the operational environments and tooling, that’s a huge friction reducer because I’ve been at several companies where data science or analytics would work in tools like R and then the operational or the product people would have to go recode it all in Java or C# or something and that could take six months or a year before it actually works. That was super expensive and introduced a lot of bugs and took a long time to work through.