Case Study

AWS re:Inforce 2019 - governance for the Cloud Age

Watch Nathan Wallace define governance for the cloud.

Turbot Team
5 min. read - Aug 09, 2019
Watch Nathan Wallace define governance for the cloud.

Disclaimer: Automated Transcript

[00:00:00] So thanks everyone for coming. I know it's the last session before... I'm between you and drinks. This is a dangerous place to be. But you know we're going to see how we can go today. So thanks for coming along. And what we're going to talk about is governance and cloud governance in particular. Now, when I'm talking about governance what I actually like to do is to start by thinking about government. What do you want from a government. And what would that mean in their cloud environment. So. Generally most people what they want from government is three things really they want to be protected from both internal and external actors. They want services that make their lives better. And they want the freedom to live their lives and do the things they need to do. Now we think the cloud is basically exactly the same. Your application teams need the freedom to be able to use those capabilities and the agility of the cloud. But you have to wrap them in guardrails and protections to keep them safe. Right. Safe from each other and safe from the outside world. And then they need a whole range of services to help us move faster more accurately and more securely through that environment. Now, the cloud changes really a number of our governance requirements compared to we are internal.

The first one is it's just moving so fast. A thousand new features a year. New services and capabilities coming out all the time. This isn't your old data center anymore. And so when we think about governance we have to be ready for them. We have to think about it in a new and different way that's able to handle that speed. Second. Application teams now have the power. It used to be that infrastructure teams had the power because they held the budget. And they could build large services and say it's not ready yet or you have to use my thing. That doesn't work anymore. Application teams choose when servers are deployed. They choose which services to use they deploy on-demand. They turn off on demand. They're in control of the infrastructure now. And from a governance point of view we have to be ready for that because that means we can't be proactive in our reviews or doing these things because that means we're in the way and slowing them down. What we need instead is to work out how we're going to be real-time to keep up with the pace of those application teams and move with them.

[00:02:15] Now while all that action and excitement is going on we also have higher expectations than ever before. I mean when you had a show like this it's kind of like: Oh my God you didn't turn on flow logs, what are you thinking right? Or you guard duty, you haven't used that and got your findings yet? You're not encrypting everything? Again. Can you imagine in our old days of the data center if everything wasn't encrypted. That was just kind of normal and expected even though we kind of wanted it to be. But now you have to do those things if you're not you're missing the mark. And by the way, those things are moving forward at a thousand features a year. Those expectations are so high. And in addition, if you get it wrong you probably will end up in the news. So, we have to now think about that control framework and how we're going to be real-time accurate with it at that scale and breadth. Now at Turbot what we believe is that that means that we're really changing now to a world of software-defined operations. We've had custom data centers, custom infrastructure and we've gradually standardize that into the cloud providers and AWS. We still also had all these custom processes and custom controls and custom the ability to get things in our data center wrapping that custom data center. We have to rethink that for the cloud. That software-defined infrastructure needs software-defined operations. Nothing else will keep up with the speed consistency size and scale of it. It has become a software problem. It's not a scripting problem; it's not a process problem; it's not a reviewing problem. It's a software problem. And we have to start to think about and accept it in those terms so we can really prepare for that future. While we're doing all that we need these services to speed people up, so we have to start thinking about how we're going to automate all of this. And you can't automate it if you're not clear on what it is. Working at a large enterprise, one of the things we went through is we're trying to move to the cloud was, "Okay. You can go to the cloud but it has to do everything we do now. And then you ask the question what do you do now?" After you've spoken about 80 people and worked out that Joe in the such and such department names the subnets and someone else chose the AMI it started to unravel all the processes and procedures, you realize that that's almost an impossible thing to replicate unless you actually use the same process. And that process takes six weeks. So the whole 10 minutes server really ain't going to function in that environment. To be able to automate that you have to think about the whole. If you're going to launch a server in 10 minutes, what do you need from a security compliance and your architecture perspective? You need to understand your networking. You need to know where you're going to land at what subnets are going in. Does everybody agree that's a valid network? Does that network have reachability? Once you're in there what security groups is it going to use. Who designed those what ports can it have? Is this a Java application? Does it have application ports open? Security groups need to be there. What AMI I am I running? Is it currently approved? Is it old? Is it patched? What size is it allowed to be? Right. And even after I get in there we might want to tag it for cost reasons etc. but I got to get into that AMI and log into it. If I don't have authorized access in that 10 minutes, I didn't have a 10-minute server. I had a 10-minute server with a three day wait for authorization. Right. So, what I need is to think about is that 10-minute server. To me that's actually the ultimate challenge when you're going to cloud to stop me will want to do serverless, they're cool things. But if you're sitting with a CIO, I think the challenge to say is I want your support to get to a 10-minute server. Because to get to that server we have to solve so many questions about operations, network, monitoring, security, approval, processes... those things. That wraps up the whole thing and one question. That's measurable.

[00:06:07] You could show that in a meeting, right. So you got to think about the whole as you do that architecture and the automation of it because it's very very repeatable. The other change that happens is that application teams actually start asking for help.

[00:06:23] When we're in a data center and we had budget differences. What happened was we ended up with is I need a server. Well, you got to put in a request and then I got to track that request. So, then I get a project manager and then you need a project manager to fight with my project manager where you're really going to be out of control. So, now we've got project managers at 10 paces trying to get this stuff deployed. That completely changes because once you go to the cloud with a model of self-service you're no longer asking, "give me a server. Is it ready yet?" You're now saying, "Can I launch a server? How do I do that? Please help me do this." So, if we get our processes, procedures, security, controls right, we can stay in that world of self-service. Stay in the world where we're helping people be successful. And stay out of the world where we're responsible for every request, where we're the bottleneck for every approval. But to do that we have to think about our architecture and automation end to end.

[00:07:21] So for us what it comes down to is a number of key principles you have to meet in the cloud to get this working. The first one is you have to understand your rules and regulations for how you want to operate. You can start with standards like CIS and other things, they are a great place to start, but actually, you have quite literally hundreds and thousands of questions to answer which is an awful thing to say but it's true. What's the name for every server; what's its Hostname; what IP addresses are we using; how do we name lambda functions; what's our tagging strategy; are we allowed to do those? How do you feel about cross-account access to Lambda aliases? Which by the way is separate from the versions and separate from the functions. You have to answer each of these questions so you can build that automation framework out and stay in control. While it's moving at a thousand features a year. And what we find is in actual fact you might make those decisions. And then, of course, everything is an exception. Right. Every project has one thing they need different. So, you can't think in terms of what's my control policy at the top that just won't work. Once you have a thousand accounts and all these different service and buckets you need to set rules like S3 must be encrypted except for this one bucket. Or except for this one account. So you have to think in terms of policies and exceptions. And you're gonna have a lot of them. You might start with 10 lambda functions; you're going to end up with hundreds of policies in a software package running. The next thing we have to do is think about our infrastructure and that's really the services we're providing. How we laying out and making sure we're always turning things on the right way. We want to make sure we always have guard duty for example or flow logs turned on that we've always got cloud trail on laying out that common architecture and moving it forward consistently across those accounts. As we move to cloud of course we're moving to that a help model and that means now it's way more about collaborating and educating our users and working with them. Now, that involves our for us two models; one is first your juniors you want to give them a lot of freedom to use it but also keep them safe. Let's just fix things, so they can learn by watching and our seniors we want them to move forward without having to do all of that grunt work. So let's make them more productive and our juniors more safe. The way you achieve that is with real-time guardrails and automatic remediation. When someone creates a necessary bucket. Within seconds it should have encryption on the tag should be right. What version should be on whatever you've chosen your posture to be. If you're playing whack-a-mole on tickets. You're going to spend the rest of your life chasing people and asking them to do things that they don't care about, but you do. If you automate these capabilities out, it changes the equation. Because what happens? I created a bucket and it was public, and you won't let me do that and that's blocking me. It's like, yeah, that's blocking you. Come and ask for an exception. Right? We can have a conversation about why you're doing that. True story. I heard that one time. It's so I can store the keys for the external C.I. server. Right? That's the site for silly conversations you end up in right? But what you want to do instead is start saying no-yeah. It's blocked, but we can talk about and giving you an exception. We can discuss how this service should work together. Right? By moving that conversation to the start through automatic remediation, you have a good chance for success. Traditionally we're sitting at the end. Right? And we're stuck in that: I have to go live, my deadlines tomorrow. This unit needs it. And you're blocking me because you won't let this happen. All right. By flipping it around it changes the conversation. Once we have those pieces in place we can give our teams that freedom. The freedom to create things, the freedom to work because we can trust that it's in control. If we don't trust it's in control we have to review everything before it happens or after it happens. We're constantly fighting that battle. Once we know we have a policy posture that's being enforced. We can move forward with way more speed and flexibility and consistency. So, what I thought I'd do is show you a little bit how Turbot works to try and achieve that.

[00:11:37] So Turbot runs as software that basically allows each user to see the Amazon accounts they happen to have access to log into. We have a whole identity model for choosing that. That's I'll talk a bit more about. But generally what you want is, you want your users going straight into the console or the APIs or terraform or cloud formation. You want to give them that freedom. Don't force them into a pipeline don't abstract them into something that they can't use. Let them have that flexibility. Encourage them to do other things like pipelines. Don't force them there. Once they're in there we just want to do something simple like creating a bucket.

[00:12:15] So it's a good sign when you demo bucket's called 13. Right. So we gonna create a bucket and let that go. What's gonna happen now though. You know fingers crossed. Is in the next 10 to 20 seconds Turbot's gonna detect that new bucket. It's going to record it in the CMDB along with who made the change. It's gonna find all the problems with it and then automatically fix those.

[00:12:39] So, we're looking for that automatic remediation and fixing all of that bucket in that real-time. Check it out. So, we come in here when you see the versioning has already being enabled.

[00:12:52] You see the default encryption has been turned on. We see the tags have been set. I didn't have to do anything all that happened magically around me. So, now my bucket's compliant. If I go back to the Turbot console.

[00:13:05] Just to refresh and not the activity list, yet.

[00:13:12] We'll see on your bucket's appeared there in our activity list created by me. And we can go in to see the detail about this recorded the full CMDB information here including every detail of that bucket. That establishes a baseline and now changes will be tracked with differences. Below that, we can see the full activity for this one bucket, so if we go down we'll see when I created it. Turbot came along, established policies and controls, and then we can see that certain things went into alarm. And eventually got fixed here. And then resolved to okay. So, we have a five-second ticket close on those items. It's really good for your metrics right. If you play that way. Now what's fun here is we've also pulled out things like tags and made them a top-level citizen. This is very helpful when you want to start going cross provider, cross platform. We can say we have one control the approved one that's in alarm. We go in here for detail on why it's unapproved and why. Because it's in an unapproved region. Now when you have controlled the big thing to decide is what's the policy posture for how we made that decision. Right. We don't want to say it's always unapproved because we need exceptions and stuff like that. So in Turbot, the way that we do that is with policies. So over here on the right, I can see policies like using approved and then things something like the region. A lot of these are very simple. So this one here is a check that it's approved. It's just in that checking mode. Others have more flexibility. For example the regions one is a simple YAMIL list of like these. Of wild cuts. Now here we can see the policy is actually set higher up on a folder. So this is applying to every bucket in every account now and in the future. It's not one-off, we didn't have to say I want this bucket this time down this way. This is a posture we've taken as an organization. But we can now create an exception for that. If we have permission and say actually you know what. I'm OK for this bucket to happen to live in EU-West 1. And we'll just save that straight away. By creating that exception, we are now saying these bucket's okay to live in a different region to the others but we've also created visibility to the differences in our environment. So if we go up we can actually see into about a list of all the exceptions to this rule. So as a security team you kind of like hey I've set this rule I've granted these exceptions you can see them all in that environment for that control. So that one now has gone to approved because it's now in a valid region given that policy structure. Most of the time your policies and exceptions pretty simple like that. But sometimes you want to go a little bit crazy. I want tags and I want prefixes of the bucket name and you know it's Thursday or whatever it is to say you should have an exception for that rule. So in Turbot what we do is we support the idea of calculated policies. Now the power of this is actually that we can query our CMDB In real-time. For information about the resource that's being protected. So that's a GraphQL query that is now found all the information about the tags and stuff for that particular bucket. If we combine that with some templating, I'm just gonna grab a little snippet here. This is Ginger code for those who like Python. So, this is a ginger template saying let's always allow US. But if the tags of the department is sales then they are allowed to create buckets in Europe. Right. If that tag is that in that way, and we can see that the policy is evaluated to the YAML and eventually come down to a list and where we can set that policy. Now what's cool about that is we can do things like for example say this is a valid policy for the next 90 days and then you're exceptions over. And they'll automatically revoke. Right. But when we save that policy. What happens now in Turbot, it goes off and calculates that policy for every bucket in that scope. It's determining the policy value based on the context of the resource it's protecting. Right. This is wildly powerful and flexible. You can't even begin to imagine what's possible as you really get into it. Fingers crossed it's gonna calculate in the second. There we go. So it's gone and calculated that it said that to the regions that bucket still approved because of that calculation. Now if we come back to that resource. Here. We can see a couple of things. I'm sorry wrong place. So. This bucket here let's go and browse it. Now what Turbot does is it finds these things into CMDB; it actually arranges them all into a hierarchy. Which is what we use for the policy engine structure. Right. So that hierarchy says this bucket actually has these controls on it we can see which ones are green which ones are skipped etc. But now we can start moving up the stack. Before I do that. Notice that we have very standard names for controls. Is this bucket active as in was it last modified recently, has it been used recently? You might want to have a sandbox and just delete resources if they're older than 60 days. You can do that right.

[00:18:25] Stuff like versioning turning on versioning, tagging, standardized default encryption, is this bucket approved to exist, these are standard categories of controls for us, standard language we've learned and built over time. Because once you start doing this for one hundred and twenty services with five resource types each. It gets pretty confusing if you've had built all of that stuff. So that standardization flows up. We have buckets arranged into regions. We can see the status of our controls for the region. We can go up to the account level and still calculating the status of all of our controls.

[00:19:00] Now if I sort these by alerts instead, we'll see that actually the area where for the most things in alarm is CIS. Now we can start drilling down into CIS report for this one account. We could do it across all the accounts across all the providers it doesn't matter. But we can drill into this and see by area. Notice how it's all hierarchically arranged into these types so you can slice and dice in different ways. Now if I come down I can actually see the status here of for example flow logs for this one VPC. And we get the full details of why and all that stuff. Our CIS controls run real-time. You create a VPC within 10 seconds, the status of that's in CIS Report. Now over here we can actually start to compare or see how this stacks up. This VPC has a few errors. All the pieces have a few hours. This one's not much worse than them in general. Networking is roughly the same. But look at our categories here for the CIS report. You can see that it's...they're all...failing this right now. And so we can start to crosscut from that and get into that. The bottom one to me is the most interesting. Most people know about CIS benchmarks, AWS CIS benchmark, there's other wones for other providers. This CIS benchmarks actually map to CIS control framework put out by the Center for Internet Security across Linux and all those different things. So you can actually categorize these controls across say Turbot supports all those categories for that reporting. So we can actually see now in a different environment across the whole environment: hat is the status of that one control? And then we can cut it by different things. For example, let's cut it by resource category, so we can see for this control here's all the networking resources, wherever they may live, cut in that different way.

[00:20:50] With that CMDB comes a lot of power. You can start searching for different things like our bucket we just created. But you can also search for things like tags. And they will find all the resources matching that tag. You can combine that with searches against native fields that are determined by the provider.

[00:21:14] So you can start to find any sets of resources here and query different new pieces of information out of that CMDB. Governance. That was a bunch of examples of some cool governance. We have by the way seventeen hundred policies like that so you can really kind of go insane trying to work through that and work out how to do it. But what we've found is actually that governance is a problem that exists not just in the cloud but across the organization. Right. We need the ability to do governance of certificate expirations, DNS servers, all sorts of different things. So Turbot actually supports complete extensibility of that policy and control framework you saw. You can write your own and have them appear in that UI with all the power of the inheritance the calculations etc.. So the one we wrote a little bit earlier (not that one) this one. So this is the Expiration check on the SSL certificate that we worked on. I'm just gonna refresh this because it was pre my network change.

[00:22:15] So we have custom policies here to determine things like the expiration check or the warning period or stuff like that. These are policies. When you edit these they have a nice enumerated list. Those are actually defined over in Turbot. Here in the policy definition. So these adjacent schema tying that together you can build those and then push them up to your server and deploy those policies then all your users to subject to that validation for those policy changes. The hierarchy of exceptions etc. For whatever governance tasks you may have.

[00:22:49] That works and then the normal way like everything else we just saw. For example, if we set our expiration date here to instead of 90 to be like 365, we'll see the policy, the control will rerun using that new policy value. And we should see a change in the status of the control in like one second.

[00:23:10] So Amazon's got 180 days or so to fix this certificate right. They're in good shape. Right. But that type of power of being used to build those sorts of governance controls in this sort of platform framework, that's what we're talking about when we talk about software-defined operations. We're not talking about a few BASH scripts and I'm talking about a few lambda functions, we're talking about consistent language, at scale, supporting exceptions, with a clear model for how to build it out and run it in real-time. Right. And that's the sort of capability that Turbot provides.

[00:23:45] So when we think about what are the benefits of this type of governance. Right. In our organization, the first is obviously speed. When you bring in this sort of software you don't have to go into find all those things from scratch. Your cloud team is massively enabled. Right. Writing those functions from scratch just like everybody else did is not adding value to your business. For your development teams, they're moving faster because you can give them more freedom. The freedom to learn, the freedom to try, the freedom to do things. Second, our safety has gone way up because we're now in control of the environment, and we're fixing it in real-time.

[00:24:23] We've made it more accessible to our junior developers and more productive for our senior developers. And we actually have a model that will allow us to deal with the sheer pace, breadth, and depth of this cloud type of environment.

[00:24:39] So that's governance. Thanks for your time. I'm not standing between you and drinks, but if you have any questions I'm happy to talk about it for as long as you'd like. Thank you.

If you need any assistance, let us know in our Slack community #guardrails channel. If you are new to Turbot, connect with us to learn more!