How To

AWS Sydney 2019 - governance for the Cloud Age

Learn how Turbot Guardrails defines cloud governance and explains it's roles in achieving best practices.

Turbot Team
5 min. read - Sep 13, 2019
Learn how Turbot Guardrails defines cloud governance and explains it's roles in achieving best practices.

Disclaimer: Automated Transcript

Nathan Wallace: [00:00:00] What I am here to talk about is governance and in particularly governance for the cloud age; and how we think about governing all of that usage and all that freedom and all that agility that cloud gives us in a way that lets us move faster and be safer. [00:00:15][15.1]

[00:00:16] So the first thing we need to think about really is what do we mean by governance. Now we're not talking politics I know we're in the middle of an election cycle here and coming from the US we just don't talk politics right now, (right) but we need to think instead about the rules and the regulations and the way we want to actually operate. That's what governance means and having done this for a number of years now and working with different enterprises from very very large banks, pharmaceutical companies, education institutions trying to do a lot with dev ops; the thing that I've gradually come to realize is that frankly governance is about giving freedom freedom to application teams and freedom to our business. When we get governance right it doesn't slow us down. It doesn't stop us from doing things instead it accelerates us and enables us to have the freedom to use the cloud. The key is getting the right combination of those factors together setting the right rules working out how to teach people and accelerate them through good infrastructure and working out how you can protect them not only from the outside world but also from themselves. And when we get that right all of a sudden we enable this freedom; this freedom to innovate, this freedom to use all the incredible technology that the US is giving us and also the freedom from regulation internally and silly processes and crazy meetings. So as you think about the governance piece for your organization I encourage you to think a lot about how you're going to give those application teams that freedom

[00:01:50] How are you going to give the business the agility they need while keeping the enterprise control you require? That agility plus control is the fundamental balance we're trying to get right. When we talk about governance I think it's also helpful to think about governance just in terms of your everyday life. When you think about government and good government what it means to you. Typically what we think of is trains running on time, health services that are taking care of us and our family, education that's available for our children. It's lofty but the goals that we're doing for our organization here are really no different. We're trying to work out how can we teach those teams, how can we give them the support they require to be successful and how can we keep them safe. Now we've been doing that for a while in enterprises governing things. But traditionally it was a whole bunch of servers probably sitting in the basement somewhere and traditionally they were protected by a lot of things like procurement processes difficult to procure difficult to set up a whole bunch of gates and controls. I mean we worked with organizations where even a VM provisioning takes six weeks. And that was after it was optimized to precreate the VM is in advance just so that the process could be sped up (right). The amount of paperwork and things we're used to going through was astounding. Now cloud of course completely changes those norms. Sometimes we like to try and say it's just the same except maybe it's software but frankly we're ripping down all of those previous ways we used to work and trying to think of new ways to really make it successful. The first thing we have to wear off and we're building that governance is the fact that cloud is just moving so fast: a thousand new features a year

[00:03:31] Yeah I hope everybody's ready for the S3 batch object handling that came out yesterday. If you've already got S3 enabled in your organization what's the status of S3 now? Can people use that? What are you allowed to do with that? The pace here is crazy, it's exciting and it's wonderful, but our challenge is to work out how to not fight that, but instead embrace that and turn into a positive for our organization. [00:03:56][24.2]

[00:03:56] We have to ride those rockets. If you're building services in competition with the cloud you will gradually lose. Elasticache, RDS, just backups things like that, they just work so well and you'd be crazy to try competing with it. [00:04:10][13.5]

[00:04:10] The second thing is and this is often very difficult to accept application teams really have a lot of the power now. Traditionally, infrastructure teams had the power why because they had the money, they had the provisioning process, they had the right to say yes or no to workloads. They got to say which services were approved and accepted through procurement. Application teams now control the infrastructure whether it's autoscaling servers or serverless. The point is that this time application teams are really provisioning the things they need. Scaling them in real time and frankly they have taken a lot more control of that relationship. You have to work out how we're going to support and enable that. [00:04:48][37.4]

[00:04:49] From a control point of view though our life didn't get easier as we're trying to get this balance right. The expectations now are so high can you imagine being in a world where every disk that sat. Existing data center in the basement had to be encrypted. Oh and by the way should be encrypted with a key that's probably sitting in a HSM module. That's ridiculous to think about, we were nowhere near it. We were sending tapes off to a mountain somewhere, hoping that that backup would come back and work the expectations in the cloud now mean everything must be encrypted; everything must use the correct key. You better be doing all it logging. Oh? You're not doing VPC flow logs to track traffic in your environment? Have you turned on guard duty? The number of services here to help you be secure and compliant and make sure that your environment is working well he's increasing all the time, which is wonderful giving us amazing capabilities we've never had before. Which is also wonderful but we have to work out how we're going to meet those expectations because if we miss we do not want to be that person with that exposed S3 bucket or that one in the newspaper on a given day. So the expectations are high and we have to meet them. Of course at the same time all of their physical infrastructure became software defined and that means it moves in real time. So our software defined infrastructure really now needs software defined operations. If you have a manual approval process on that infrastructure you are too slow. If you're using a spreadsheet you're out of control. So we have to think about how we're going to move our posture from one of the old process to one where we're really controlling our infrastructure with those software defined operations. Operations... Security... these have moved from being people problems, process problems, to being software problems; and our challenge is to work out how we're going to embrace that, and move with that, because if we get it right the speed, pace and accuracy we will have is way beyond anything we've ever had before. [00:06:37][108.4]

[00:06:38] To get automation right of course it's actually incredibly difficult. We have to have such clear definitions of how things work. Every exception better be very very well-defined. Every configuration needs to be known. If you think about your existing organization and all the policies and procedures you have it's probably like 35 documents (e.g. "Oh, we know how to name those because Bill over networking always sets those up.") None of this works or scales once we get to a world of software defined operations; so we really need very very clear and consistent architecture. [00:07:10][31.4]

[00:07:11] When thinking about how to make cloud successful in large organizations the single thing I would now ask for when going to management is I would like your support to get to a place where we can provision one server in the next 10 minutes that everybody in your leadership team agrees is an official, valid, blessed server for our environment. Okay it's a server it's a little bit old school in the world of serverless but here's why it's interesting because to do that we need the ability to self provision. We need to know that the network where provisioning into as well defined managed etc. We need blessed allies. We need patching. We need the ability to know we're monitoring it. We need the ability to log into it. The ability to log into it is a big statement in the world of a server. How do we know that that new server we have permission for. If your internal procedure requires you to have approval for every new server for access. That's not a 10 minute server anymore. That's 10 minute hardware with a three day approval process. So getting that full end to end architecture well-defined is critical to the success of that operational governance environment. [00:08:13][62.2]

[00:08:14] The good news is that it fundamentally changes a whole bunch of the things about the relationship we have internally. We're used to a world where people request things and others deliver them. I need a server, I need access to this thing, I need storage. Oh, fill out this form... Let's do that etc. Now when we move to cloud those things are instant. They're available. We can pretend they're infinite. And the relationship changes with those teams if we can move from one of support: "I want to server, you need to give it to me." To one instead where it's like "I need a server, how do I do that again? Which button do I press to start that? Which API should I call?" If we can move our relationship between our central governance teams and our application teams from one of support and request giving to instead one of help and teaching (and I've seen this happen this is not fantasy land) this is what happens when you can provision that server so quickly, it changes the relationship. And why? Imagine when you had that application and you come you need this infrastructure so you'd come to a meeting and you'd sit in a meeting room and say I need this infrastructure. You bring your Project Manager, they bring their project manager you put your project managers at 10 paces from each other and start arguing about timelines and when you going to have it all those things. [00:09:28][73.6]

[00:09:29] In a world of 10 minute servers you go to the meeting and you just start the thing right there and all of that crap melts away. Now the discussion is how big? When? Well it's up to you, right? You can do those things provided you are secured, patched and stuff like that. So cloud, really, and the governance around it can change completely the way we have those relationships in the business, and gives us so much opportunity to do it better. [00:09:51][21.9]

[00:09:51] So when we think about that governance for the Cloud Age and how do we actually achieve that. There's a bunch of ways you can tackle it, but here's some thoughts for the key things to think about: The first one is you better define your rules and regulations. You can be an incredibly secure, uptight, "I'm going to approve everything type of organization", for whatever reason: healthcare, financial services, you just care a lot. That's a very, very, very aggressive approach to governance. On the other end you might say "Hey here's your Amazon account there's some keys. Good luck.", and frankly we've seen both. Once you have 400 accounts that were in the "Here's some keys, good luck world" and then you start trying to work out how to govern and it gets increasingly complex of course. But if you're at the other end where everything has to be done exactly; you have locked up your business so much that they have none of that freedom anymore that we care about. So while we're avoiding politics there still is a discussion about: are you a big government believer or a small government believer, when you're thinking about your governance right and how you want to tackle it. [00:10:49][57.3]

[00:10:49] So in Turbot, the way we think about that is you really need a policy engine. You need a hierarchy of policy settings. We have seventeen hundred but basically the way it works is we have a hierarchy of those settings coming down. You set rules at the top like S3 must be encrypted. Simple example, a nd that will then flow to every bucket now and in the future in real time. But of course you're not doing anything real if you don't have exceptions every enterprise thrives on their exceptions and how special they are. [00:11:15][25.8]

[00:11:15] So in that policy engine you need the ability to say "this account does not require that" or "should just be checking mode, not enforcing mode" or even "this one bucket should be skipped for these purposes". So as you think about those rules and regulations make sure you think about how you're going to manage those exceptions at scale and that flow of policy in the environment. [00:11:32][16.9]

[00:11:33] The second thing is really infrastructure or services. This is what we want government to provide, right? Give me great trains. Give me great health care. Give me good services. Make sure you answer the phone quickly... And that's what we're thinking off for cloud as well. We want the ability to have really good networking out of the box: it's nice, set up, works well, safe for me. One of the most fundamental decisions you're going to make is your account structure. I'm sure many of you already been through it. Should we have one or two, should we have four or five hundred. Generally we recommend you should go to more accounts isolating your workloads almost like mini data centers for each application you're running or even application environment. So really think about that core infrastructure you're providing as that governance team. What's your posture for that. How you're going gonna set that up, and what your best practices are. [00:12:16][42.8]

[00:12:17] After that we need to educate our users and make sure they know what they're doing and how to use this. And one of the interesting insights we've had working with lots of customers going through this process is actually, in an enterprise, you want to move more away from self-service for those accounts and more towards a high-touch onboarding process. Not because it's hard or it sucks but so you can teach people and review their architecture and think about that with them as they're coming into the environment. [00:12:42][25.4]

[00:12:43] By doing that in a high touch way you start to create those relationships, that lead to that world of helping them achieve their goals. Not one of: request > fulfill, request > fulfill... So unintuitivly, it's actually good to think about helping with that architecture onboarding... The high touch process of bringing them in, and then letting them run with the self-service from there. Of course protection. Protecting those applications both from the outside world, the other applications in the environment and even from themselves. In our mind what that means is really: real-time guardrails. Whenever an S3 bucket is created it better be encrypted, the tags better be correct, access logging better be on. When you create a dynamo table, backups better automatically get set. Every network should be configured and running the right way. [00:13:29][45.4]

[00:13:29] Real time responsiveness to that changing infrastructure. You don't really control the infrastructure anymore because the application teams are now in charge (they have that power). What you can control is the response to that, the automatic remediation and the posture you want to wrap them in and that's what we think of for the protection. [00:13:46][16.9]

[00:13:47] The key here is to move from a world of checking and reporting and playing whack-a-mole with tickets and running around going "oh, you really got to encrypt your bucket" and instead flip to a world where it's happening automatically. You want to kill those tickets not just close them. If you prefer ITIL speak: think of it as... Instead of managing all those small incidents, think about the problem. What's the problem? They were able to create infrastructure in a way that is not configured to our standard. The root cause is we're not automating those fixes. So we need to bring in automation to automatically remediate and fix those things and make it happen in real time that will speed them up and protect our environment. [00:14:22][34.8]

[00:14:23] Once we get all those things right of course, the goal here is to get freedom. Freedom for those application teams, and the ability to use the cloud to do what it's intended. In our mind, the key to that freedom, is not abstracting them from that cloud. You need to give your developers access to the console. They need access to APIs. They need the ability to use cloud formation, terraform, all those infrastructures code tools. The combination of those things is how they will build their application, it's how they will learn, it's how they will follow tutorials on the Internet. Anything you do to abstract your users away from that... is making it harder for them. Harder for them to learn. Google searches are useless because they can't just follow the instructions. Even something as simple as saying "we do everything through a pipeline", that's a beautiful thing by the way (infrastructure as code) we do it all the time, that's a good goal. But if you say "everything must be done through a pipeline" you've just created an abstraction. You've now said "you must always use confirmation there is no console for you", "You must always use terraform", "There is no other way to do it". Now, that's really great, and it's a good goal. Maybe for production that's the perfect goal. But if you think your whole environment can work that way with that level of rigidity, you're going to find that you're stifling the freedom and innovation of those teams. So some of those things are good, they're necessary, but they're not sufficient from a governance point of view. From a governance point of view we need to do things like react in real-time to what's happening in that environment and make sure we're repairing it. [00:15:49][85.6]

[00:15:49] That gives the teams the freedom to work while knowing they're still secure protected and covered in that environment properly. So, freedom, is our goal. That agility for those business teams is our goal. To do that, we need to give them access to the tools and services and stuff they need and that means making as many of those available as quickly as we can, with as much direct capability as we can. [00:16:13][23.4]

[00:16:13] So what I was gonna do now was give you a quick demonstration of how Turbot looks and thinks about that problem. This is one way to tackle it and then we'll come back to talk about some more things. So Turbot for us runs as software in a customer's environment not SaaS. We want our users to have direct access to those Amazon accounts, so each user would log in and see a handful of accounts. They happen to have access to the first and primary thing we want them to do of course is actually just use the amazon console. We don't really want them sitting or abstracted away inside Turbot for that. So I'm just gonna do something as simple as creating an S3 bucket. Having done that now the bucket creation is not that exciting... I appreciate you bearing with me. What's cool though is the next 10 seconds. The cloudwatch events are going to send that to Turbot. it's going to detect that new bucket. It's going to record that bucket and who created it into the CMDB. As a result of having that information in the system it's now going to test it against all of our real-time controls. Does it have a valid name? Is it in an approved regio? Is encryption on? All of those different controls that you care about. [00:17:13][59.3]

[00:17:14] It's gonna raise alarms for any problems it finds. It's then got to automatically remediate those, and close the alarms. So if we go to that bucket we just created (and if the demo gods are shining on me today) we should see in the properties that Turbot's come along and started setting things like the versioning on, it set the default encryption up, it put the tags in place. So I spoke before about that policy hierarchy that engine of policies including things like setting metadata for cost centers, recording things like who created the bucket, that sort of information, nowing what are compliant tags. All of that flows through the system here to create those things in real time on that resource after it is created. We see server access logging now also being enabled. Meanwhile, back in Turbot we should see the new bucket start to appear in our notification list. Turbot has detected that bucket and brought it into the CMDB; in the controls tab, we can see each of the things that checked and ran with at that time. Including, for example, things like the tags. As I mentioned you wanted to be tracking all these changes in real-time. So the first thing is to go from okay to alarm state. The second thing we did was we fixed those tags and then the third was we closed the alarm. That's a five second ticket close. That's difficult to do in a manual environment. Now we can do some of these things with lambdas and stuff like that, but what we've found, is that you really need a lot of visibility into what's going on in that environment so you developers and application teams know what's happening; and so you can for audit purposes really get to the heart of it. [00:18:38][84.6]

[00:18:39] So we track things like what event led to that to happen. This was a "create bucket". And we track things like the context that we had at the time, what was the policy settings and stuff at the time. So we have for posterity how that decision was made and how it acted in the environment. When you are taking aggressive actions for example deleting a bucket in an unapproved region you want to know why and how you made that decision. The way the controls work is determined by the policy engine server and the policy engine we can see that this one was set to enforce setting tags on the bucket. I happen to have permission to be allowed to create an exception, most users wouldn't be allowed to do that. We like to break our rules into "must": "You must do it this way" ("required" is another word for it), and "should": "Here's a recommendation", "this is how your postures should be". So in this case I'm gonna create an exception for this bucket... I'll just set it to check. Of course you want expirations on those sorts of exceptions: 90 days approval... "Fix it up and then and then you're back to normal like everyone else". When we create that exception in Turbot, it will now manage that bucket according to that. That's a single setting for one bucket in one account out of potentially hundreds. The cool thing in Turbot, is of course, that you can then see all of the exceptions below you in the environment. So for your security teams and your compliance teams when they want to know what is the posture in this environment they can start to see the full information about the exception the settings in the environment. [00:19:53][74.3]

[00:19:54] Turbot also let you manage things like permissions for the environment. We simplify 3.5 thousand AWS permissions at this point, down to simple levels like: user, metadata, read only, operator, admin and we do that per service. So if I go to add a grant, we can see things like the metadata, read only, operator, admin, owner and superuser because there are special times. For each service we break that up that way, and it gives us the ability to do that through the whole AWS stack and more. Of course you want your permissions to be time-based so you can do automatic expiration, and things like that. [00:20:26][31.7]

[00:20:27] So that's a quick taste of what governance can look like in real-time, at large scale, with policy settings and all the stuff we're talking about there. To give the application teams the freedom to use that. In terms of the benefits of this... When we get it right what happens? [00:20:38][11.7]

[00:20:39] The first thing is we just get a massive amount of speed for something like but that's software right. I said operations is now a software problem. It's not a people problem it's not a customization problem it's a software problem. If you install this offer in an environment or choose a different one. Seventeen hundred policies coming out of the gate. Best practice for identity and access networking. All those things are automatically configured in the environment. The speed you get from that type of change is astounding, compared to building yourself. The second benefit is safety. We have one customer they implemented Turbot because they're development teams like "Hey we got this we don't need anyone in the middle we're going to create our own stuff". What's the first thing that happens? Someone creates a necessary bucket makes it public; Turbot stops it. They start complaining: "hey my bucket is not working, It's meant to be public". Why is your bucket meant to be public? "Well I have to store the keys in it." The keys in it.? "Yeah. I'm doing an external pipeline with using another SaaS tool needs access to the keys, so I want to put those in a bucket, so I can reach it from there." I mean you're kidding me... Right? But this is the sort of stuff people do when they're just trying to solve problems. And why? Because cloud is hard and they lack the experience and that's why we need governance and guardrails to make it more accessible. By having good governance and good guardrails and good rules that happen automatically; we can drastically open the number of people who can use services and the number of services they're allowed to use. If we have to approve it all by hand... Check everything they're doing... Restrict them... We're just locking down that freedom and creating a hell of a lot of work for ourselves, which we don't need to do once we're in that world of automation. For you more experienced users, it's all about productivity. "Why would I go and recreate all those things?" "Why do I need to learn that how to do it?" One of my favorite stories for that was... We had a customer, and this was an internal one, and they came and started complaining about their $70 bill for the month. They were just using S3... "we're just using S3 this shouldn't be seventy dollars..." And we say "well let's have a look" because it's governance we're multi account, we can see what's going on. They're making 100 million requests a month to that bucket driving a lot of costs. What was the actual real thing they needed to do? Fix the application to stop querying. Which, by the way which was from everybody in the company's phones draining batteries all over the place. They fixed that, the bill goes to zero, but more importantly everybody's phone stops draining battery. Governance drives productivity for senior devs as well. [00:22:58][139.0]

[00:22:58] Finally, the main thing is as we have this breadth of 120 services and the depth we need to cover them you need your teams to get more and more to a Zen like state where they're comfortable with what the clouds firing at them. That is really hard to do the pace of change here is so high. It's scary. You've got to reinvent; you go in super excited, you find out all that and you come away a bit overwhelmed. I'm sure many of you had that feeling, but with good governance and software doing that, we know we're well protected out of the gate, so we can look to experiment and learn. From there, knowing we have that breadth and depth of coverage. There's just no way we can build it ourselves from scratch. [00:22:58][0.0]

If you need any assistance, let us know in our Slack community #guardrails channel. If you are new to Turbot, connect with us to learn more!