AWS re:Invent 2019 - beyond the scripts - governance automation master class
Learn how we use governance automation to accelerate teams
Disclaimer: Automated Transcript
Good afternoon or morning. We're kind of in the middle right before lunch. It is awesome to be here. My name is David Boeke and I am really excited about talking today about governance. I was just talking to my colleague yesterday and mentioned that it's only two days in the conference, but I'm feeling all this energy. I've been at the booth for the last two days talking to just existing customers and new customers. And it's just incredibly energizing to be around tens of thousands of people that love this stuff that we do. So normally in your day to day life, you can't even explain what you do to the person that you're meeting at a dinner party. And here everyone is in the same field. The majority, vast majority of us love this and we're doing it because we have a passion for it. And so it's it's really my pleasure to speak to you about the topic of governance today.
The the main thing that I want to get across in this talk is the power of governance at scale. For your large enterprise, before I before I jump into the talk, though, I'll level side a little bit.
I had kind of a little bit of a governance adventure on my way here on the flight. I was boarding the plane. And as I boarded the plane, the flight attendant asked me to see my boarding pass. And I said, OK, I know where my seat is. You know, you don't look. She's like, no, I want to make sure that you're on the correct flight. After all, the TSA checks and the gate checks on all that stuff, we have that one final check at the entrance of the plane. So I'm going to do a little like gate check right now and make sure all of you are on the correct flight. We're gonna be talking for the next hour about governance. I mean, explain what it is. I'm going to then talk about that in context of large scale public cloud adoption and what that means for that use case. And then I'll share with you the maturity models that I see from a day to day basis. I have one of the coolest jobs in the world. I get not quite as cool as winners, but I do get to meet on a week to week basis. Large enterprises, large government organizations that are highly regulated and really fundamentally want to look at governance automation as a way to enable their cloud strategy. I talk to them, I understand their problems. I identify their pain points. I help them orchestrate solutions for that. And always in always in the mindset of how can we do this through automation and not do it through manual processes.
I then get to take all that information I learn and turn around and work with by far the best development organization I've ever worked with and share those requirements with them. And then they produce some amazing software for those companies to use. So it's a very cool job. I want to share with you a lot of the maturity models that I see in large scale public cloud adoption as it comes to governance and kind of where some of the pitfalls are. And then a lot of us are here because we love cloud. We love building things in the cloud. And as governance architects, you probably want to build this stuff yourself. I have built this stuff myself before. Our lot of members on our team have done that as well. We love it. It's a lot of fun. So I'm going to walk through a solution architecture for how you would go ahead and build your own automation for large for large scale cloud governance. And then I'll show you a demo of the way we approach the problem. Looking at our tools. So if everybody's on the right flight, we'll go ahead and get started. The first thing I like to do in describing governance and talking about it is really frame it in the context of something that we're all familiar with and we're all familiar with governments. That's where the word governance came from, came from. We are all citizens of some governments and those governments do great or not so great job of actually providing governance for their citizens. So let's talk about what we expect from our government. Right. First thing we expect from our government is policy, right? We want them to set rules and laws around how the government is operated and what the people can and can't do. They also want them to provide oversight. So we want them to police what's going on. We also want them to have judges that can adjudicate adjudicate the gray areas in that space. And then from a management standpoint, the government that they want and we give them tax dollars and then hopefully they build infrastructure for us, roads, bridges, schools. We need those things in order to operate as a society. And it's it's generally better if there is some amount of government oversight into how their services are provided. We can argue about if it's big government or small government, but there needs to be some. Finally, the thing that most people think about when they think of cloud governance is protection. And that's the same thing for governments as a citizen of a country. You want to be protected from external threats and internal threats. You want to be protected from your neighbor doing bad things as well as like foreign governments doing bad things right. Finally, we believe fundamentally herbut fundamental believes and I do as well, that an element of good governance, whether it's governments or EITI governing. Is transparency. Transparency of what are the rules, how are those rules being applied? Ah, how are the people who are adjudicating the rules? Approaching that and what's the cost of them to do that? Whereas our tax dollars going, whereas the spend going, etc.. Right. So we believe fundamentally that that transparency is a huge part of it. And all of these things apply. The next section we're to talk about, all these things apply to I.T. governance as well, especially in the cloud space. We have frameworks and policies in I.T.. We have auditors that come around and make sure that people are following them. We also create infrastructure in terms of v.p.'s, in terms of networking, logging, etc. We also protect ourselves from threats, from people trying to steal our data, from insiders, trying to do bad things or from insiders doing stupid things. And hopefully through that process you create transparency as well into what you're doing, how you're doing it and who's doing what. What is that in service to? I believe in government and there's probably some some governments that don't agree with this. But I believe the purpose of government and the purpose of that government providing governance is to provide freedom for citizens. Right. So if governance is working appropriately, people who are law abiding have freedom and can do what they need to do without the government getting in their way in the same way EITI governance should be designed to provide freedom for the people that are using your applications and building your applications and analyzing data within your environment.
So it's a data scientist, it's an end user, it's an application development team. They're the ones they're the citizens of this environment, right? They're the ones that we're governing over. And our job as governance architects is to build things in a way that gives them freedom.
How can I give them freedom and have them do what they need to do so that freedom is going to be best enjoyed if it happens quickly? Right.
If in the government scenario, if you're falsely accused of a crime and you're in jail for seven years, waiting for waiting for your trial and then you're acquitted at trial, that probably doesn't feel like a governance to you. Right? You know, you were finally acquitted. Everyone said nothing. You didn't do anything bad, but you were in jail for seven years waiting, awaiting adjudication. Speed of governance, especially at 90, is critically important. And no, and that is fundamentally one of the major shifts when we talk about cloud.
So let's talk now about that same framing. But what are the unique challenges that Cloud has for governance practices?
Right. First one is agility. Your business is moving to public cloud, moving to AWS moving applications there because they want to achieve agility. Maybe they want it. Maybe it's also there's also cosplays there. But fundamentally, every single enterprise I've ever spoken to, their primary reason for moving to cloud is to gain agility. When I used to work at a Fortune 50 health care services or healthcare services company, the it took six to eight weeks to have a VM delivered to an application team. Right. And this is not procurement time. This is there is already hardware sitting in the data center and someone needs to procure or a provision of VM on top of that hardware and then give access to an application. Team member six or eight weeks. That's incredible. Well, the reason that is the reason that occurred is that there were multiple manual processes with queue times and those tickets went around between the various teams that all had to interact with each other and do that. And so you have basically this ticketing system and a flow and all of that happens. Now it's feels it feels actually. I was just I was just at a customer site and we were sitting there trying to do something live. And it was like six minutes and like we are waiting for this Windows instance to spin up. And it was like, this is taking forever. Right? And it's like now like six to eight minutes. You have an easy to instance, less time than that, you know, like. And then and then when you start talking server list, it's like instantaneous. Right. It's you're starting to measure that in milliseconds when you do that. That agility equation, do not forget that a second one is expectations. So as governance professionals and security professionals sitting in this environment, you have a huge amount of expectations on you. Amazon is publishing best practices. They released brand new features this week that are brought groundbreaking around security policy that can be measured logically. Right. And and then you've got to asked publishing standards. You've got this, you've got Hipp, you've got all these different organizations that are essentially saying here are the best practices for doing these things.
And when you don't follow them, when you don't have them in place, when you do have an issue, then it's your problem. It's not that application team, right. It's it's going to come back to the governance professional and say, you could have known about this. You were at Reinvent in twenty nineteen. You heard all these things. Why didn't you implement them? Right. So the expectations become your requirements. Those best practices that are published because they're known in the world are become your requirements for your organization. And then the last piece is Control Central. He has lost control of the infrastructure, right. We no longer deploy have those long lead times to deploy things and it can't too, through the environment. So applications. The application code itself can actually spin up the infrastructure. And in that model, the control is now on the application team. How they design their application to scale within the cloud is going to control what infrastructure is deployed. So all three of those things are tightly couple and and in some ways we feel like they're kind of fighting against each other, but they're not really. Hopefully I can I can show you that. So let's talk for just a minute and I'm going to compare what we used to do in I.T. governance compared to what we can do today. So first thing is on the rules. Right. So we talked about rules. We used to have, you know, control frameworks, big, thick, thick documents, p.d.'s Sawan and document management system.
We trained everybody. We put it on their genomes for the year that they had to be trained three times a year on our frameworks. They open up the document. They look at it the first page, they close it and they check the boxes. They did the training. Right. And then we give them some checklists, like if you're more advanced, maybe you gave checklists. And so now they have checklists they can follow and make sure that they're following those things. That's not possible in the cloud. When you move to public cloud, in order to automate the infrastructure of that, you need thousands of policies, not just policies like don't do bad things, like don't open up public buckets. You need policies like what's your subnet naming scheme? What is the host net naming scheme for a new easy to instance that speeds up? What are your acceptable CIDR ranges? How do we prevent their seiter ranges from overlapping? You need a policy engine and you need policy code that can look at all those things. And fundamentally make decisions about how to automate right. Without and without all of those rules decided on, you will never be able to automate your infrastructure. You need to decide on those rules upfront and easy to capture them from an oversight standpoint we talked about earlier.
If you're in a you know, normally what I've seen in large enterprises that people do an architecture review generally a day before they have to deploy to production and then and then they kind of use that as a mechanism to kind of get the security teams to give them exceptions. And then if you're lucky, if you're a high value add application, you'll have an internal auditor come in maybe once a year or once every other year, look at your application and see if things are done as they were written.
Right. That's not what you need to do in cloud. We need real time event correlation. We need to be able to respond to that creation of the infrastructure that's happening in real time, that the applications controlling and respond to that from a management standpoint, we had those central services and I.T. operations that we could cue through. Now we've got, you know, automation around those platform services in the infrastructure. Services and protection is interesting as well. I used to be we thought about protection in terms of barriers, right.
Let's do the physical security around her data center. Let's do the perimeter security around our network. And we're in this nice bastion.
And and no, no one can.
No one can get by our moat. Right. And in the world of cloud, we're going to talk about this a little bit more detail. But you need preventative and corrective controls that are running in real time and looking at your systems and deciding that you can't protect the perimeter. Your perimeter is now the entire Internet. Well, you have to do is protect each of the individual assets that are running within your environment. And then finally, transparency. My favorite my favorite topic, if you guys can figure that out. We used to update the CMB once. If you're lucky, when that application deployed or was updated. Right. If you had a good architect, he checked to make sure that everyone checked their information, their architecture diagrams into service. Now, hopefully, someone capture all the host names of the servers that were launched. Maybe someone updated the list of software that was installed on their servers, if you were lucky, right? If you were good. Now you can programmatically you can query API and find out what's in your environment. That idea of having a static CMB that's updated upon change of your application is completely out the window. You need a real time cloud scale seemed to be. You need to know every configuration change that's happening, whether it's program happened programmatically or happening via people that are taking actions within your environment.
So. I really love this. I really love this quote, but.
In order to one of the major things that you're going to have to decide early on in your governance process, this is the major pillar that is going to decide whether you accelerate quickly to the cloud. We took we had a speaker here last year from biopharma company. They went from zero dollars spent on any to us. They wanted to they made me decision to do a full datacenter. Migration went from zero dollars need to spend to $500000, notably I spend in less than six months. Right. And fundamentally, this next decision that you make around this is going to decide whether you are a fast adopter and quick adopter of the cloud or whether you take a really long time in us and a slow ramp up to that. Everyone's going to get there. It's just it's just how fast you go. And what I'm talking about is multi account isolation. A single most important decision that you're going to make as part of your governance strategy is do you do multicast isolation or not? And this is what I typically see, which is that early on and cloud adoption, you've got a cowboy, someone on your on your R&D team that goes in, swipes a credit card. They build an account. They build an application that gets a lot of buzz, a lot of interest from, you know, from the business. And they become a hero and they've got there they've got their antivirus account out there. And it's a social thing now. They've created a lot of interest in it. They've interested some other application teams and doing that. And then they say, hey, come, come work. You can. You can have some space in my account. I've got I'll create you a new VPC or you can come in my VPC. Right. We can all we can all share this together. We'll be happy roommates and we'll make that work. And then that starts to get old. You start to get two or three people into that account. They start to do Landos and they start to cross over or someone is trying to fix their application and they change the security group and then they break they break someone else's application. And what happens? What happens when they break someone else's application anyway? I'm sure I heard a lot of laughter in the audience. As soon as they break someone's application, I.T. comes in and says, please stop. You guys are doing this all wrong. We know exactly how to do this. We have a playbook. It's called Eitel. And now we're going to roll it out and we're going to have services. And you guys are going to basically say, this is what I need. And you're going to ask the service team to the point that service is going to go deploy it for you. Everyone's gonna be nice and neat and clean. Our centralized teams are going to keep everybody from stepping on each other's toes. And this is as a governance architect, as a security professional, as a CTO. This is the primary point of decision making that's going to decide whether or not your cloud accelerates or not, because if you do that model, you will not be in a good place. You will be here in three years, only having moved a few workloads and and not have the cloud adoption that not have the benefit of the strategy that you put in place.
We feel the right model is to lean into multi account isolation, to gain a whole suite of benefits in that process. What multi account isolation is, is you isolate workloads into different U.S. accounts. That's right. So every Ablett's account, the same account structure that separates corporation from Corporation B now separates application A from application B even to the point of separating application A's test environment from their production environment. Right. Even to the point of separating application A's Micro Service 1 and Microcircuits 2 and Microcircuits 3 and the separate accounts if you really want to get crazy. I think lot of us has some customers out there that we've talked to that have like twenty thousand eight of us accounts. Now, that's super scale in this model, but more accounts is better than no accounts. And the reasons of that, right. There's a there's a bunch of reasons. The first one is that you can group resources together. Grouping resources together into accounts are great way to manage costs. I can have all of my application teams have separate accounts. I can. They can have the costs of those accounts charged back to them directly. You also limit the blast radius. So if something bad happens, if you have a bad actor come into the environment. If someone makes a change that cascades, then that blast radius is is limited. How many OBC to see a show of hands?
How many people today are currently under change? Freeze your end chain freeze in there on premise environments.
Yeah, about half for the people on the video and and the rest of you will probably be in your unchanged reason a couple of weeks.
So the reason that we have here and change freeze is because businesses closed their books. Public companies need to provide financials to the street at the end of the year and they need to close their books quickly. And it's Christmas time and it's New Year's. And no one wants to be in I.T. and be working and trying to get an ERP system or another system up and running at that time of the year. Right. So we implement these change freezes to prevent cascading changes in the environment. Now you have a developer working on a new application that's gonna drive new business, basically told that they can't deploy anything new to their environment until January 5th or January 15th. Right. This is fundamentally what we don't want to replicate in the cloud. If you put things together, if they're in the same environment, you are going to run into that exact same change problem and you're gonna be that you're gonna have the same.
You're gonna have the same lack of agility in the cloud that you do on premise.
This also a surprising thing. People don't think about this as account limits. So AWS has pretty sane account limits for new accounts. So if you're building an application, you can spin that up. If you're putting you know, if you're putting five hundred accounts, a five hundred applications into a single account, you're going to hit every single U.S. account limit that exists. Now, a lot of those can be up to you can call support, you can get account limit raised and get your API limits raised, etc. But those there are some things that are hard limits that you can't have raised as well. And then you have to work around. Right. So. So by separating applications into separate accounts, you reduce that noise by ninety five percent. You'll still have some big applications that like bump into those things and they can they can manage that themselves. But by and large all of your dev accounts and mostly your production accounts will never need to ask for account limits raises anymore. Then user access accounts are great ways to segregate user access in your environment. Right. So we had how many people were at re-inforce? Usher, OK, that's a pretty good number of people, fewer re-inforce. We have a speaker there. One of our customers, McGraw-Hill Education. They have like 80 they have like 80 different to pizza teams, right. They have those two pizza dev teams in 80 different of, you know, 80 of them. And they're all working on different applications and different stages of lifecycle development. Right. Accounts are great ways to separate and provide user access only to the resources that those teams are working on. They can't see logs from other applications. They can't steal code from allegations. They can't break other applications. It's a great way to do user access. And then primarily, though, when you think about this, it is fundamentally possible if you're just lifting shifting from on premise and bringing a bunch of easy two instances in to manage this through tagging and other things. But as soon as your team start to build cloud native applications, try and get four or five development teams working on a lambda based application, all working in the same account and then come talk to me. It's it's not it's possible, but it's it's it's a lot of overhead to manage it.
The last point I'll do I'll multi account isolation for all you regulated companies is audibility, right? If you have multiple applications running in the same account and an auditor comes in and they start looking at that application and that application share's logs or shares data or shares change with other applications, they immediately can expand their their audit scope to those other applications. Right. So they find a change that occurred within the accounts and then they start tracing that back and they find that's application be application B now came and scope of that audit. Right. So you just it increased your audit risk by a factor of X, depending on how many applications you have in that environment. Those are all great things.
I fundamentally believe this. That's that's the key to that's one key decision. We're gonna talk about another one here, which is the maturity model and which way we go. So I see three basic models for companies that are adopting cloud. They generally look at control and agility as opposite ends of the spectrum. We're either going to have control or we're in to provide agility. Right. A lot of companies are stuck in this, stuck in viability. So you just spend weeks or months or years actually trying to figure out how to do this stuff. They come to reinvent. They get information. They go back. They put together PowerPoint decks. They share in with each other. They talk about what they're gonna do. They talk about what they're gonna do more. They talk about what they're going to do more than ever, launch a single server. They never create a single account. Or maybe they create a dev account or they play around with it a bit, but they don't really do a lot. Right. And then you have other organizations that actually go, yeah. Now we're gonna do this. We're gonna roll out, I tell, and we're going to move all of our existing on premise processes out to the cloud because we know that safe and secure. And if we do that, there's not gonna be any questions. We don't have to change anything. We just roll that out. And so what you end up with is an environment that's fairly stagnant. S. Similar to on premise, same processes, et cetera. You don't have a lot of agility in that environment and your business partner start to get antsy. When those business partners start to get antsy, you start to see those rogue actors come out, they start building things, swiping the credit card, doing stuff on the side. Eventually one of them will do something stupid and get in trouble and then I.T. will come back in and they'll start bringing those things in. They'll they'll they'll pull them out of there and then they'll even go farther into the control frame and you'll have even less agility. Right. The other model isn't really good either, right. You have you end up having, you know, I.T. basically just publishing some best practices and then letting the cloud teams go and do whatever they want. I have a large multinational company that we work with that did this for five years. They ended up with four hundred eight of us accounts, all different teams, no control, everybody using the same site. Our address ranges using default v.p.'s. You had some teams that were super due diligence and came and did all the research and did everything right. And you had other teams that basically just started spinning stuff up and not giving one care in the world in terms of what they were doing. Right. And now they're trying to bring all that back. And as soon as you start bringing that back, it doesn't feel good, right? It never feels good. There's something about the human condition. We hate having things taken away from us. And so no matter what the benefits side of that is, right. You as a governance team, if you're doing this late stage, you're going to you're gonna be the guys that are saying, no, stop doing what you're doing, et cetera. Right. So what's so if those aren't the right way is what's the right way?
Well, fundamentally, I think this is the magic trick that nobody knows about or no one really thinks about, which is that we always think of governance and we think of this control and agility as being opposite ends of the spectrum.
But the reality is, when done right, when done through automation, the control aspects, the governance aspects are the actual thing that sets your cloud free and accelerates it to what it's doing. Right. Because think about it. If you do governance right, if you provide all those five services to your governance platform and you automate it, those developers can immediately. They're not worried about like the five minute server anymore.
They're worried about the five minute account. Everything is set up for them. They're VPC is set. They're logging has set up. The guard duty is set up. Controls are set up. We're gonna talk about some the architecture that everything is in place and ready for them to start building their application. You've just accelerated their. Timeline by an order of magnitude. They can they can go try something, they can fail quickly, they can turn it all off if it's done. The magic trick here is that the governance that we've been afraid to put in place is actually the thing that sets you free and allows you to accelerate and execute your business strategy.
So what is what what are the what are the capabilities of that? Right. So you need to have central logging set up. You need to have identity federation setup. You need to start writing some scripts. You need to be tagging resources.
You need to be, you know, looking at the you know, at the environment. How do you know how do I set up guard duty, how I do that consistently? How do I know? How do I make sure that cloud trails sending to a central bucket, etc, etc. You need to create some boundary limits. What can people do and can't people do? Are they allowed to use any region? Are they allowed to use subsets or regions? Right. You got it. You've got to think about that. And then you want to start building some checking scripts. Let me check what they're doing. Let me see what they're spinning up and make sure that they've got encryption on. And we make sure that they've got data protection set and they're backing up data.
Also, let's record change. Right. We've talked about that cloud scale seemed to be everything that's changing in the environment. If we're checking it, then we can record that and we can see what's happening. And now we can start looking at those events and triggering off of them. So now when someone spins up an AC to instance, we can kick off our λ, we can see what's going on and drive that. And then we can start notifying people, hey, I saw that you created this security group and you opened up ss_h, the default route.
That's not our policy. The automation put it back. Hopefully that's an automated message that they get exception. Management is really interesting because it's that exception management is the thing that most people don't think about when they're doing this. I talked about having thousands of policies for every policy that you write you're going to have multiple exceptions for, right. So every policy will have an exception. There will be approved exceptions in your environment. The most common overall overarching policy for everybody in this room is probably no public access to S3. Right. But we probably all have Web site accounts and they have public has three buckets to serve, you know, to serve binary objects. It's a perfectly valid use case. It's a great use of the tool that is an exception that needs to exist in your environment. And you will have thousands of those exceptions and then you have to have a way to manage them. You have to know who created the exception, who approved the exception? When was it created? What are the what are the what's the approval timeframe for that? Right. Is that exception go away after 90 days? Was it a P.O.S. that we allowed somebody to try out something and then we pull it back? Those are things that you have to manage. And then finally, the transparency equation. We need a dashboard. We need a way to share this information with our stakeholders who are who are basically funding this effort and be able to show them what we're doing. And we also need a dashboard to our citizens, right to our application team. So our data scientists said they know what's going on as well.
So at this point in time, I think the second major thing that you have to think about a lot of pushback I get so I was to maybe talked to three hundred people in last two days at the booth. And the major pushback that I get is we don't need that because we do everything through configuration. Right. So all of our architectures are designs. They're built into confirmation templates or terraform templates. They're checked into version control. We have architects looking at them and building them. And so we don't really need to check all this stuff behind the scenes because no one can ever do anything in our accounts. Right. The only way to do anything, our account is to go through this whole process. And I don't dissuade anyone from that. That is a great use case. It's a nirvana of infrastructure is code that we all should aspire to get to. However, it doesn't work for everything. Right. So you have the default configuration of a brand new account. Right. A lot of that can be built as stacks, but some of it can't. You have preexisting accounts, right? So you'll have those rogue users that already had their A.W. s accounts and then you'll also have mergers and acquisitions when you get those M and A's. And then you have another entire organization that comes in your environment and you have a time crunch of 90 days to get the M&A done and you have to bring them into your governance platform. Right.
You also have ad hoc configuration that happens, right. Someone's trying to troubleshoot a production issue and they jump into the console and they change something. You have to deal with that. You have infrastructure as code, right? I have a company that's doing IAPT. They are providing an IOC service for other businesses. So people are logging into their application on a minute by minute basis and signing up for the service. And if they sign up for the service and they check the box, it says, I want my data isolated. These guys spin up separate VPC, separate databases, separate instances, write the application. Architecture is is dynamically responding based on the usage of the application. There is no way that you can do that through configuration. There's no way that an architect could look at the configuration of that application at the beginning and say, yep, that looks good, right? Need to be looking at it at runtime and seeing how it's behaving with the variables that were built into it. And then finally elevated access. Right. Most of our financial services customers, this is their main concern, which is regardless of how much control you have, there is somebody in your organization that has access to your organization's account that can create exceptions to your S&Ps, that can log into those accounts and circumvent any control that you have in place and do something right. And you need an audible mechanism to be able to tell what they did when they did it, why they did it, give them that access for a limited amount of time.
Tie that specific access to a specific ticket that they were working on. And if it's not tied to a specific tick they're working on, sit down, have a conversation to them as to why they were logged in and super user during at 2:00 a.m. on Thursday night.
Right. So you the continuous compliance is required regardless of the maturity of your C ICD pipeline in your in your architecture, you need to have both defense and up.
So now that we've talked about the requirements of of this architecture, of this governance platform that we're going to build now, let's walk through the fun part. Let's walk through an architecture set before we start putting a single line of code in place. I'm going to sit down. We're going to whiteboard out what the actual architecture will be to to create this application. So the first thing that we're gonna need is an native U.S. organization. Right. Any best organization is the foundation of what we're building here. We're going to create or create organization or credit building account cradle a central logging account.
Create a networking account, central networking account for a direct connect center. BP ends. Create an audit account for the auditors to come into where all that data goes. And if we're doing single sign on, we'll probably have a single sign on account. There might be some other core service accounts that you build into this as well. Just kind of depends on the architecture that you're doing. You then have to make a lot of decisions around. I am. Am I going to use programmatic users?
Am I going to use rolls? Am I going to use combinations of groups and users and roles to achieve different control objectives? Right. The problem with doing one of those things is that if you if you go into roles, you're either going to have a lot of roles that you manage and which people don't like to do, or you're going to have roles that are very broad in scope. Right. And then you have users that have a lot more access than they need.
So you really have to fundamentally decide between fine grained access for an individual person or or fine-grained access for a role and then have a lot of roles.
And we need to make sure that we're good. We obviously we want to track everything that we're doing. Cloud Trail is a preeminent service. Every single API call that you do on a to us goes to call trail. So we need to have that as well.
So now we're developers now, right where a governance team we used to write documents and WIP and frameworks and now we're going to be writing code. So we need a co-development environment. Let's start looking at the Amazon SD case. Pick your favorite one. I like Python. Our developers like the Noge SDK and then you need a place to store that code. So Code ComMitt is a great repo. Get repo for that. That's private within your environment. And then we can start orchestrating some of this stuff together. We put our code into code committ. We then create a pipeline through code deploy and then we get some confirmation templates and some land and lambdas out the other end. Let's start building the account structure that we do. This is the table stakes of doing this on your own and building this infrastructure. Right.
So what are we gonna do? Oh, they're saying we'll do well. We'll set up a web app where we're gonna test this automation that we're building. Right. So we need a canaria accounts that we can actually build and we can test our governance automation in that space. We're gonna connect that to a nearby single sign on or our favorite single sign of choice.
We need make sure that we're tracking that we're using camus'. Right. You know, Warner loves to where the encrypt everything t shirt encryption is one of the easiest, lowest cost things that you can do that adds a measure of security. So you have to decide, hey, I'm going to make people use customer manage keys. And if I do, do I make sure they're rotated, etc. So I make those decisions and then build the automation for it? Am I going to hook all my accounts up to guard duty and monitor all the API activity through guard duty and then writing the automation that invites those new satellite accounts into your organization and then sets up the guard duty master slave environment? We might hook all that up into security hubs. We have visibility to it. We definitely need to turn on native U.S. config and configure rules and look at some of the existing rules and probably write a lot of our own on top of that.
And then we start talking about the, you know, easy two instances, if you're a large enterprise, you have a lot of on premise stuff. You want to move over to cloud, you want to make sure that stuff is patched. M is a great tool for interrogating what's on your instances and then also patching them. And then you have all the dos default v.p.'s they're sitting out there.
So the interesting thing about default v.p.'s is that what happens is you leave those default default v.p.'s in place and you want to build your environment all and you s.e one, you set up all your direct and ex-U.S. each one you create all that. A new user comes in, they get their new account, they log into A.W. S for the first time and they start building stuff and they start putting stuff into their VPC. And two weeks later they figure out that they just built all that and u.s.a.'s too, because that was the default setting on the browser when they when they came in and they didn't switch. Right. So really important thing to do when you're spert and when you're bootstrapping an account is deleting all the default VPC using a script for that. Right. And then you need scripts to stand up your VPC. What are your design patterns for your VPC? What is the t shirt sizing for that? Right. You probably have a small, medium, large VPC, maybe, you know, a slash size 26 as a small size twenty four is a larger size. Twenty two or less it's less. Twenty four is a medium sized weight to is large. You need that to find the site or ranges that are acceptable for that use because you don't want these application teams stepping on each other side arranges. So you need to build that in and then you need to connect that VPC to your Transat Gateway. Right. You had your networking account. That's where you ran your direct connects your VPN. You're now going to connect that VPC that you did to your transit gateway. So you have your networking in place. You have all your security control plan in place. You delete all the default v.p.'s and now you're really ready to get cooking. So I'm going to on-board an account. What's the first thing I need? I need a database of accounts, right. What are the accounts that I have? What are the metadata around them? What's the account tags? What is the what's the purpose of that account? Who owns it, et cetera. So I need an account table dynamo. DBE is a great tool. You can quickly create an account set table. It's no sequel schema lists. You can just kind of continue adding data to it over time as you find that you need new data and you also need a place to store all of your logs.
Right. So all of these Landra functions that are running, some of them will work. And some of them do not work. Right. You never know.
You might show up on a Tuesday and Amazon announces a new type of availability zone in Los Angeles that has a different format than the naming convention than every other availability zone. And it breaks some of your rejects and some of your landos.
And so then you have to respond to that, right. So you need a place to store your log and configuration files so you can see when things are working and when things aren't working.
So now, after all that's ready, we're ready to launch our first production account. We create that account and we're gonna send that to the team as soon as I've been to the team. They're going to start building things. Right. And so I want to see what they're building. Right. I want to have visibility. I want to have that Real-Time access to what's going on. So I need to build an event stream around that. So I take that cloud trail information. So cloud trails, monitoring everything in their account. Every time something changes an API calls made. I send that to cloud watch. Cloud watch filters it. There's I don't care about everything. I care about certain events. I filter the events and cloud watch. I send them the S and S and I put them on. Q Right. So I now have a series of events they've put on a Q My lambdas pick that up. And then when someone launches an already s instance in a public subnet, I know about it and I can write a rule that says that's not acceptable. Let's terminate that instance and send that user a message that they're not allowed to do that.
That is the fundamental architecture that you need. You can then start building bell bells and whistles on top of it. Right. The next set of data elements that you're gonna record are what are the events, what are the notifications, what are your policies that you've written? And then what are the exceptions that you've written? And you probably need a relational database for that or is a great solution for that as well. You can also create some visibility into your log files that you're putting into S3 using Athina. So all those log files you'll have, you literally have millions of log files sitting there after a few months because you're monitoring all these things in the Real-Time events, you'll have millions of log files. Athena is a great way to kind of research and get information and then can use something like SC asked for sending email notifications. This is that fundamental architecture that you need to build in order to do that governance at scale. You can do this on your own. It is a lot of fun to build. You'll learn a lot along the way. If you don't have time to build that or you don't have the team that has the expertise to build that, that can be tough because Amazon is pushing a thousand changes a year. Just this week, like we went from one hundred sixty five services to like one hundred and seventy five services. We'll probably be in the one eighties before the week's over, right. They have 16 regions and growing. That is a huge amount of change to absorb.
You need a big team that's monitoring that stuff and doing it. When Amazon releases a new service, your job is a governance architect is to evaluate it, to say what is the enterprise configuration of that service and what are the guardrails that I need around that service to make sure if it's gonna be used within my environment or not. So you need the team in place in order to do that. How Herbert thinks about this is we think about it in terms of creating that freedom for the application teams, making sure that the cloud team has automated guardrails, giving the cloud team the ability to specify their policies and define them.
The application teams that fundamentally need to have self-service those application teams, we want them to directly use the cloud. We want them if they like using the console, let them use the console. If they want to use TerraForm, let them use tear form. If they want to use the CLIA, the API is let them use those things. Do not abstract your users from the tools that they love. That will slow them down. Let them use whatever tool. But you've got to be cognizant that you've got to run alongside. You've got to watch what they're doing and you have to put guardrails and boundaries in place of what they can do to prevent them from creating risk for the organization. And that's essentially what herbut does. We have the software that we were just talking about, a very similar architecture to this that you can deploy within one of your environments that monitors all the activity across all of your clouds or all of your all of your clouds and all of your accounts and kubernetes and the OS level. We do that in real time. We ingest all that information and then we do real time matching of your policy sets against that data and tell you what's wrong and give you the tools in place to automatically remediate them. So let me just quickly drop over to a demo and kind of show you how we do that.
So this is the this is the Turbot console, this is our web based UI that we provide to our citizens and let them see what's going on. So this is a governance architect view, but it can also be and user an application team, view a, you know, a dashboard for, you know, your management as well.
So I'm logged in here to this reinvent demo account and I have access to that account. So that single sign on the federated identity that we were talking about before, I can simply choose a role that I want to log in s and then get an STF session out to A.W. S.
So with that single click I was able to authenticate to eight of us with a time limited SDK token. I only have access for a limited amount time.
What's important about that is that the users never have access, keys or credentials that they can take with them and go home and sign into your environment in order to do in order to log in. They need to be on your network because the software, your your cloud portal is essentially running on your internet. So when those users go to log in, they have to be on your network and they have to authenticate to that environment before they can actually log in and do anything. Right. And that's typically a federated identity. Right. So you're logging in with Sam, all your logging in with a.D.A.
Faster logging in with your, you know, 80 user credentials, et cetera. That's how you get access to the environment. And then you can then get access. I just showed access in the console, but this works the same way. The same STF sessions work vended through the Seelye as well as through the API. Right. So you can write automation around this. You don't have to give out access keys and you don't have to give out user credentials to any user.
That's really important in terms of in terms of securing your environment. So now when I'm in here as a user, I can do some things. So I don't have I don't have a bunch of scripts on. So I'll just create a few buckets here. Reinvents.
Demo Zipzer 3. If I could spell. And I will create the bucket.
And.
I'll take a look at it so you can see when I created that bucket. I basically created what we call a naked bucket, right. So I didn't really do any configuration to it. I didn't add any tags to it. I didn't any add any default encryption. I didn't add a versioning. I was pretty, pretty bad steward of my environment.
On the flip side, though, Turbot knows about that bucket. Right. So.
So turnout already discovered that bucket existed. Why did that exists? Because we use that architecture I was talking about earlier. Cloud Trail sent an event through S.A.. Back to Eskew Eskew, Turbot λ, pick that up and basically said, oh, there's a new bucket that was just created. Right. And now if I look at the activity on that bucket.
I can see the activity. I can actually see all the way down here less than a minute ago.
Myself, as this persona wait, what's logged in and created the bucket. And then Turbit started working on it, right. And we have a lot of policies that we do on Bukkit. First thing we do is we ingest all of the current configuration of the bucket. You can see that configuration right here in this detail's page. This is a Yamal configuration of all of the things that exist on that bucket.
And then we started to alarm. Why did we alarm? So let's look at let's look at some of these alarms. So first, are we had a couple of alarms here.
First alarm is that the tags weren't set correctly. So I have standards as an enterprise around how I tag information and how I make that work. I also have standards around versioning. Right.
So in this case, I have versioning set correctly are not set correctly. So versioning wasn't enabled and encryption and transit wasn't enabled and probably encryption and rest was unable as well. So, yes, a default encryption at risk was enabled.
Now that triggers additional lambdas that then go fix those things. Right. This is the key to to compliance at scale. If you are a governance team and you have something that scanning your environment once a day and sending you a list of a thousand things that are wrong, you are swamped. Right? You are never going to be able to write that application because you're gonna be spending your entire time in e-mail, basically conversing with those application teams, begging them to please change their stuff and fix it. Right. You need automation that will automatically fix it at the time that they provision those resources.
So all that happened within. So if you look at this 12:13 p.m., I created the bucket. And by 12:14 p.m., we did that. So if I go back in the S-3 management console and I now refresh.
The bucket we can see that I've set now, termite has set default encryption a_s 256. It said a bunch of tags on the bucket. Oh, look, it says this is a bad tag environment.
Can't be blank, right? You need to have an environment. But Turbo doesn't know what environment this bucket is. So let me tell it that it is a proud bucket.
So I just told it's a broad bucket.
Whoops. I clicked on one thing and then Turbo, it also enabled versioning for that bucket.
Now, let's say for sake of argument that versioning is expensive. It's not really expensive, but let's just say it's expensive for my use case and I don't want versioning on on development environments. Right. That's not something that herbut has a built in guardrail for. So how do we how do we approach that problem?
I'll go back to my view here.
I'm going to take a look at. So this is something to look at about policies now. So going to go in here to policy types.
I go to eight of us go to S3.
Approved. Whoops, sorry. Bucket versioning.
And then I'll look at the policies around Bukkit versioning so I can see that I have I have one policy here and it's what we call a calculated policy.
A calculated policy is interesting because it now gives your government seem the ability to extend the rules that we've already pre built in.
So I can have a simple rule if I wanted to if I wanted to create a simple rule on this.
I get switch I could switch to standard mode and I could just say check that it's enabled or enforced, that it's enabled, I can do simple rules like that through the UI or through TOREFORM.
I can also go into this calculated mode. And this calculated mode is really cool because what it can do is.
Well, I don't need the I don't need the example, but you guys can tell I can write a simple I can write simple graph Q I'll query that looks at pulls in the bucket tags for a bucket and then I can look for a tag named environment. And if the environment is dev-, then I'll set versioning too disabled. Otherwise, I'll set versioning to enable. Right. So that's why versioning was initially enabled. The value of that tag was bad tag, and so that didn't match dev. So now I know that that policy exists and in place. I'll just go back out here and I'll update my tag to Dev.
And save it. So now the tags dev- and let's see what happens on the bucket. So again, that whole event stream is still taking place. When I update the tag that sends a tagging event back through that same toolchain and comes back, you can see that the bucket was updated.
You can see that the bucket was updated here and I can actually see what was changed, so not only do I have visibility into what's going on in my account, but I have line item visibility into how that configuration is changing over time.
So I can literally see here or go see Bukkit template change bucket updated so I can literally see that I change that bucket.
The value of that individual tag from dev to prod and we have this granularity at every layer of the infrastructure all the way down to the OS level. Right.
And you can also see now that because I've changed it to about also responded to it and set versioning to suspend it. So we'll just take a look at that real quick. So you guys can tell they don't have any magic up my sleeve here. C-in-C versioning has been disabled on the bucket without any action on my own other than updating the tag. Now, that's activity on a bucket. It's pretty simple to do. Let's look at something that's a little bit more advanced, a little bit more a little bit harder to actually to do. And that is and that is easy to instance.. at the host base level.
So what I'm gonna do is I'm going to go to Søs M.
Systems manager.
And I'm going to use this awesome tool from eight of us call session manager.
Ancestry managing to start a session. This is going to allow me to start a session on one of these who'd been two instances that are running here, and I'm going to get an ss_h session.
What's really cool about session manager is that it's using my federated identity. So if I remember, I logged into turbot with my enterprise federated identity herbut then logged me in with a with a time limited S.A.T.s token into eight of us. And now I'm authenticating at the instance level in that same persona. So I know my federated identity from the enterprise is logged into that server, and every action that I take in here is recorded in cloud trail.
Everything that I do is recorded in cloud. And this is super small. So hopefully. There we go. All right. So first thing I'm going to do. Is. Wonder why that's not working. Let's say, you know, the real. All right. Somewhen us. I'm typing, but it's not doing anything. It's. All right. I'm just going to launch session manager again. I'll do this demo instance.. 05. Maybe that was there was an issue there with those six. Actually, this might be my might be my Internet connection as we hang out here. All right. So let's go to 0 5 and start a session. And we get a command prompt, Sue. All right, so now hopefully making this bigger doesn't screw it up. Sue, do as you want to.
So I'm going to switch over to the abantu user. So I log was logged in as the FSM user. I'm now on a bunch of user I can see to my home, so I can see that a little bit better. So I'm onto this abantu instance. I am an administrator on this instance, so I have access to do things right.
I can sue too. So I'm going to do something bad here. I'm going to sue to. To etsi ss_h ss_h d conf. Oh, there you are. Thank you. I'll do I'll do them. Make the people happy. All right.
So so I'm now in ss_h camp and I'm going to do a couple of bad things. I'm going to say I don't care. Let's permit empty passwords and let's. Authorize a permit reblogging. So two things. Two things that I want to do and I'm leaving an egg. I mean, we're gonna gonna go to the extent of just saying, yeah, I don't care. Just permit logging. Right. So I'll say that. And now I'll pop back out to talk about it. So in the same way that we get that event stream from Cloud Trail, Turbo has a way to bootstrap OS query and build to send an event stream of all actions that are taking place on your host based OS is right.
So let's drill in and find that if I mean U.S. one. I'll go back here and we've got a couple of instances. I didn't notice which instance I did. That was. Well, we will. We'll take a look. So here's all the information that Turbit knows about that easy to instance.., and then I can actually look at the the 1 2 based host that's running on there and I can see. All the activity, if I want to look at something like if I want a look at the resources and I want to see all of the all the kernel modules that are installed in this environment, I can do that. I can also but I can also set policies at this level, right? So I'm at this host level of a filter policy. Eight of us. I'm sorry.
All policy types, Linux services, ss_h, SSD, ss_h config, and we can see here that I've got a bunch of things that I can do so that permit empty passwords is a policy that we have built into the system. I can see what the setting is for it. For that as well. If I do a new setting here, you can see that I can enforce it into one or more different modes or configurations within the environment. So Turbo's had enough time. It's discovered that we made that change. Let's go back and take a look at at the activity here. We're at 10 10:51 and we're ten ten thirty eight. So I'm just going to bounce back up. There you go. All right. So I have that ten, ten, thirty one so I can see that SS HD was updated. I can see what was changed on that environment. And then I can respond to that change and corrected in real time. So you can see here that turbot actually corrected that back. If I go back to my systems manager window and I reopen, you can see that or maybe not. C-in-C permit empty passwords were set back to know, and if I actually scroll to the end of this document, I can go to permits. Roots. Prevents roots..
Log-in. Anyone know what the. There we go. At the very end of the file, Herbert added a permit to log in. No. So rather than override that setting because I incompetant we actually out of the sending at the end, that overrode the value. Right. So within a few within a few, 15 seconds after making the configuration change tobut identified the back configuration and then rolled it back. And that's really what I wanted to show you. I'm going to talk a little bit. I mean, switch back over to the slides and just wrap up. We.
What I really wanted to share with you is that we really love this stuff, right? We love building it. We love organizations that are out there building it. Our best customers are organizations that try to write this stuff on themselves. We have a platform that we feel can accelerate your journey towards building this type of automation within your own environment. And that platform is deployed within your own account. That platform gives you robust capabilities to detect change, to prevent configuration drift, to correct to correct configurations, as you saw in the demo, to create transparency for your entire environment. You can search and find resources across everything and to really enable business agility to give back to your teams the agility that they sought out for when they moved to the cloud initially. So thank you so much for your time today. Hopefully you got something out of this if you enjoy these topics and you love talking about them, if you're building it yourself. We would love to talk, shop with you and figure out how you're tackling some of these same challenges. Please drop by the booth or I'll be available here after the session for a few minutes to answer questions. Also, Bob Cordella and a couple other team members that can answer questions as well. Loved the opportunity. Come talk to you today. Thank you all for your time and attention.
If you need any assistance, let us know in our Slack community #guardrails channel. If you are new to Turbot, connect with us to learn more!