AWS Mumbai 2019 - governance for the Cloud Age
Watch our talk at AWS Mumbai on Cloud Governance
Disclaimer: Automated Transcript
Welcome everybody and thank you for coming to the session. I'm Nathan from Turbot and I'm very excited to be here in India and in Mumbai. First I was excited to be in India so I could visit our office in Kolkata, meet our great team there. Then I saw the IPL final on the weekend with that amazing last bowl victory from Mumbai so I was even more excited, and then finally today here at the summit. So many thousands of people working through the question of How to move to Cloud, How to get their Enterprise ready. There's been some amazing conversations and it's great to be here.
So today I'm going to talk to you about Governance. And what we mean by governance is really this question of how you're going to look after your environment and move with great agility while keeping it in control. The agility of the cloud is very, very powerful, it's something we're trying to grab as much of as fast as we can. But we have to do it in a way where we're staying in control and keeping ourselves safe. And that's what governance does, just like the normal government you live under every day that tries to provide you with those services and those capabilities while also really enabling your freedom. That's what we are trying to provide for our customers, our clients, or our internal teams when we think about the cloud.
01:12 So what do we mean when we talk about governance? We're talking about how do we protect you, how do we educate our users, how do we set up that infrastructure and the rules, to give them freedom. Before we get into that though, it's important to think about why and how this is different than what we've done before.
Often we think of cloud is just being a virtual representation of things we've done before and it's not there's a number of things that make it uniquely different. The first is just the sheer pace of movement of the cloud. Amazon's adding a thousand new features a year. You simply can't compete with this with your own internal little IT team and building your own little capabilities. You need to think instead of how you're going to ride this rocket. How do you enable these capabilities for your business, not competing with them, instead enabling them. How do you let them use RDS with all those capabilities, how do you let them use KMS, how do you let them go serverless? If you don't, you'll get run over.
02:15 The second thing we have to be aware of is the fundamental shift in power. Application teams are now starting to control the infrastructure. Previously infrastructure was something where a team held the budget, they built the servers, they had the physical assets and they controlled what went where, and it just doesn't work that way anymore. Application teams they are provisioning their own infrastructure, their own serverless, they're doing those things in real time. The idea that you're manually reviewing requests is gone. As we move forward that's becoming a software problem.
Coupled with all that agility, we have to think about control in new ways. Specifically, the first one is, expectations on getting control right are so high now that we can't miss. If you miss and your public-facing bucket is found – you're in the paper. If you miss, why didn't you encrypt your data? KMS is there, why didn't you encrypt in transit? What? – You're not running guard duty? You have to be doing every single one of these things. These new capabilities are amazing and powerful but the responsibility we have to deliver them goes UP with every single one that comes out. It comes back to that pace of change and how you're going to keep up as a team.
At Turbot we believe this means that the change from this physical infrastructure to this software-defined infrastructure is coupled with a change towards software-defined operations. If you application teams can change your infrastructure in real time - which is the goal of cloud - then you need operations that can change in real time along with those applications. You need enforcements that work in real time. So we have to think about how to move forward from a place where we're chasing tickets, handling incidents to a place where we're killing those tickets with automation. We never ever want to see those issues again where automation and software to move them away. We start with scripts (most people start with scripts) but where you want to end up is running full software for your operational stack.
Now as we're trying to enable our business with agility and control, the next thing we have to think about is how we're going to accelerate them, give them best practices make them move faster. And for that what we found is that a clear consistent architecture is the fundamental building block. We used to hand build networks, hand configure a server, name things carefully, treat them like pets.
Now we need to think in much broader strokes. We need to think about how we're going to deliver on something like a 10-minute server, and to do that we need a full end-to-end capability for that stack. Who can login? How can they login? What's a valid server? How do we name that server based on its IP? By the way it might get stopped and started with a new IP. So you've got to be able to handle every single one of those things with a clear, automated architecture design. That requires significant and difficult thought and particularly at large scale. When our customers are running hundreds of Amazon accounts with hundreds of applications, you need a consistent approach.
05:55 The other very profound change that we're going through – particularly for those who are thinking about services – is that we're really moving from a pattern of support to a pattern of help. When things were done through requests we would create a request and then we try to fulfill that request. So we were used to the idea that we could ask for things, check on status. We end up with a project manager on the infrastructure side arguing with the project manager on the application team etc.
Now, when you rethink that in the world of a 10-minute server what happens? You come to the meeting and say “I need a serverâ€. Ok, start it. It's over, right? So that world changes from one of chasing down requests, chasing down support, to instead helping people to get it done. Help me set up a server. Help me provision storage. Help me get serverless running. The cloud teams, the support teams, are now helping those application teams. Application teams hold more of the capability, the budget, the journey, and now the cloud team's job is to help them.
So for us, as we've thought about that, we started to think about what does governance mean in the age of the cloud? We broke it down to a few key things. The first one of those is you've got to come up with your rules and regulations – just like any society. How do you want to work? You can start with compliance frameworks and standards – they're awesome. But if you think about your enterprise how it really works, there's a policy here that says how we're going to store data. There's another policy over there that says how we're going to name a service. There's Joe over in the network team who knows that we create wrap tables and happens to name them in a certain way. There's a whole bunch of enterprise knowledge caught up in people as well as policies and procedures.
07:56 If you're going to automate that, you have to codify it, standardize it, think about it in well-named, well-architected, repeatable ways. Everything. Naming standards, IP ranges, server types, server naming schemes. The whole thing needs to be rules and regulations because you're an enterprise, and you know there's going to be an exception. “Everything's encrypted! Except for that one there because it's a special project.†There's always exceptions. So not only do you need the rules, you need the ability to think about them with exceptions, and with the capability to adjust and change those.
Second we have this concept of infrastructure. To us that means, what are the core capabilities we're laying down in the environment to accelerate our application teams. Networking for example, that's an obvious one. Logging buckets, turning on CloudTrail, setting up your different governance components, how do we speed you up? At Turbot we think of that in certain terms of automating all the things you don't care about but are very important. If you create a Dynamo table, Backups should be turned on. If you create an S3 bucket, it should be automatically encrypted. You need that whole flow of infrastructure accelerating and speeding up your teams. That does not mean abstracting them – I'll come back to that in a minute.
As you speed them up of course, and as we move to this world of helping, not supporting, education is critical. We're the cloud experts, right? Imagine that as we scale that to our application teams – how are they going to keep up with that 1000 features, how do they think about encryption and all those capabilities. So what we need to do is find a way to help our juniors and our freshers and give them the ability to use the cloud with speed and scale without making mistakes. Learning from that, watching how it changes around them and fixes things around them so they know why that's important.
And by the way, this is so much harder than it was even five years ago. If you had a young programmer five years ago, they'd be learning Java. Now what are they learning? Oh I've gotta learn Java, I need some serverless, I've gotta run a thing here, I need a KMS key talking to my SQS queue – that is a massive amount of information to absorb, but that's the standard architecture for something like a serverless app. So we have to work out how we're going to educate and speed these people up.
Now normally when we think about guardrails and governance we all go to security and protection. Protection from external forces and protection from internal problems; it's critical. We spend a lot of time thinking and talking about and it's the perfect use case. But what do we mean by that protection? What we mean is real-time automated fixes to your infrastructure. If the application team now has the power to create things – could be in their app, could be in their console – we need a way to react to that and automatically fix and help them in real-time. We can't go and check it, send them a note later, and spend our whole life playing whack-a-mole on tickets. We have to do automated response to that. We want to kill those tickets and automate those problems away.
At Turbot what we do is we watch what's being created in that environment through the software. If they create an S3 bucket, within 10 seconds we turn on the encryption, set the tags, turn on the axis logging, and 12 other things. Real-time remediation means you don't have any chance to really do it wrong, and you don't have to then review something at the end – by the way when you're stuck between a deadline and the business, that's the worst time to review something – instead if you're enforcing it from the start, those things are done in development they move through QA, and if they don't work at the start they're not going to work. You can go through exceptions and work through that process earlier, without automated real-time enforcement, you lose control of that sequence and that priority.
12:01 As we mentioned at the start, the sum total of all those things is to give freedom. What do we mean by freedom? In this context we mean business agility. The ability to create an application, to deliver on that need, the ability to try serverless, to use queueing. Why not? If you have the budget and we know it's encrypted and we know it's patched, why would I stop you? What incentive does any cloud team or central control team have to slow you down in that circumstance?
And as we automate more and more of those pieces away, we can give that freedom, we can give people the chance to use the console, we can give them the chance to use the API. In our mind, the fundamental point of that freedom is not to abstract them from the cloud experience. Your developers need CloudFormation, they need Terraform, they need the console. They need APIs. If you say everything runs through a pipeline and Terraform, you've abstracted them from the cloud and you've made their job slower and more difficult. I'm not saying you shouldn't use it – you should use it – but it's part of that learning experience. How will I develop? Creating things by hand. Making mistakes and having the chance to retry them. We need to give people the freedom to experiment in that way now.
13:25 When you get that combination of policies with exceptions with automated rules, that freedom gives you the chance to learn in whole new ways. In particular, you can work with an application team to do something like this: Say that we never really used your Dynamo DB never really used your DynamoDB..I'm not sure what our requirements are but I'm going to give you an exception for 90 days, you learn with this, you work with us as the central cloud team or the services team and we'll work this out together. That's a completely different experience from the ‘We don't support that' world we used to be in. If you have those automated guardrails you can speed up that whole process.
14:07 I would love to do a demo but I don't trust the internet so I'm gonna skip that. I'm happy to talk about that later at the booth and set you up if you'd like to see it. Instead what I'm going to do is continue talking about the concepts, which yeah, if I did the demo I'd show you why Turbot does it better than anything else but, for now the key thing is to show you why these concepts are important.
14:30 The first one is if you get this right, the speed you can move increases drastically. Now, speed has two dimensions to me. The first one is: what's the speed of my application teams? I give them an Amazon account, they can create resources, they can try services, they can run stacks, they can do all those things quickly and easily in that environment. The second concept of that speed is as that central team for controls, how do I enable those services and those rockets to my business? If you're trying to do all of that by hand, writing scripts, you're moving too slow. Something like Turbot, we have 1700 policies out-of-the-box so you go from zero to enterprise at the point of installation. Imagine starting to form a team and thinking about all the meetings you're going have to have around identity access, networking, all those things to solve those problems in a common well-named way. You can do the first ten pretty easy and after that it gets exponentially difficult.
15:30 The second benefit when you get this concept of governance right is safety. We had one customer, a DevOps-focused customer, who originally were trying to do everything through pipelines and the central team collapsed under the weight of all those requests. So they wanted to enable their application teams because they're banging on our door to do stuff, and through guardrails they are able to do that.
Now the first thing that happens – someone tries to create an S3 bucket, they try to make it public, then think: why are you stopping me from making this bucket public? Then you ask, well why did you want a public bucket? They say they have to publish the keys from their CI server – it reveals that the person was trying to use an external CI server to talk to the public bucket to get keys to give you credits for that? Turns out, that is just a bad pattern. And we wouldn't know about this unless we [turbot] were in the way of that user doing it, and reacting in real-time. That's the type of safety you get across all these things if they aren't doing anything stupid, they just could keep going, but instead it created the right point of friction, the right point of guardrail, to have that important conversation.
16:32 Accessibility. The key here is how do we make this cloud more available to more of our developers. If it's only available to our biggest experts then we're losing. We need our juniors to be able to use it, we need our freshers to be able to get started. We need the ability to give them that level of power without the fear of what they might do. That's making the cloud more accessible to all of our development teams and all of the people in our organization, which moves us forward faster. When you get this level of accountability and responsibility right, the productivity impact is incredible.
We've seen this at multiple organizations where you get that first ten-minute server that everyone agrees is an official server – it goes viral. Nobody can go back to the six-week server, there's no way back. Once you get automated encryption of your buckets, do you think your security teams are ever going to let you create something internally again? It speeds you up, moving you forward in the right ways, gives you better productivity.
17:42 One story I enjoyed: we had you a team who was just using S3 in an account – very simple case – and it turns out that bill was $70 and they were so upset, it should have cost nothing. So we said, fine, let's take a look at the details. Now I'm making 100 million requests to S3, right? And this is just a small mobile app. It turns out they had a bug in their application that was constantly pulling the server, so they went and fixed that bug and their bill drops to zero. But what was the real impact? All those mobile phone batteries in the organization stop draining so quickly! And they would never known if they hadn't had the depth of knowledge and understanding of their account sitting isolated with their own bill with their own responsibility. Normally, we'd just lose $70 in the noise. When you can bring that up and make it accountable, you can see that together.
18:39 The other thing that we think is super important as you're moving forward with the world of governance is this. Something like Turbot - yeah we're covering 1,700 policies across all those different services, we're getting feedback from all those organizations, moving that together in a single place- that's the power of that software and that reuse. If you think this is a scripting problem, you're fighting against that. You're trying to hack some stuff in for yourself. That level of breadth and that level of depth – no one organization team can compete with. That's the difference between trying to ride a rocket versus trying to tinker to build one in your own garage. It's fun tinkering, I enjoy it, but that level of speed, breadth, and depth, you can't do unless you accept that Operations has become a Software problem. It doesn't matter how many humans you throw at it, you can't keep up with that level of consistency, pace and speed.
Finally, as you do all of that, you can get comfortable with the speed of cloud because now you've got someone in your corner who's helping you do those things right from the start and you've automated so many of the existing ones you don't have to worry because you've got the time to think about what's new. If you're constantly chasing the old tickets you're going to be stuck
I'm Nathan I'm from Turbot we have a booth over there if you'd like to talk to me or any of the team. We'd love to talk to you more about governance – we live breathe and sleep this – so if you have any questions please come over and let us know. Thank you very much.
If you need any assistance, let us know in our Slack community #guardrails channel. If you are new to Turbot, connect with us to learn more!