Lessons learned from a large implementations of cloud automation at scale (Lascon talk)
Learn how Turbot Guardrails enabled agility for a large scale education company while enforcing governance.
Disclaimer: Automated Transcript
All righty so a little history one of my recent jobs Lugg showed up the first day found out we had over 80 development teams a couple million of production users and several different regions huge application stack one AWS account within few weeks it was pretty apparent we were hitting hard limits that Amazon didn't know they had to the point where ec2 just could not give us any more instances one of my first projects there was splitting that stack from the giant lift and shift that most people do when they come to the cloud and to at least pradhan on prod let's start there let's see where we could go now we have a few problems the developers are still slow the hell down we can't control that there's control over who has access to what because with 20,000 ec2 instances how do we know who actually has access to those boxes and how do we get the devs to do best practices well one of the things we we looked at doing is instead of doing the traditional lift and shift which they did was to move this we're gonna build as an architect and in CI CD development goes from I pushed this into to a git repo and then it builds to dev run some test builds to QA what we what we brought in was the ability to move your entire environment that way so now your production release is a whole am I move you push the entire stack all the resources get promoted through it we now know what's running in prod has been fully tested there wasn't a oops there was a young update that now there's a new version of PHP in prod then there wasn't in death or wasn't in QA and tracking weird bugs that way we also have another benefit of we know exactly what's on those boxes and as we thought about it a little more we went we need to split this down more
We need to really bring it this down because we still have too many resources lambda when it came out you can only handle 75 gigs of Londa code we had two revisions per function and we hit those limits so now we have to split it again but we end up with another problem how do we isolate the workloads how do we isolate the access because our tier one team definitely needs to be able to log into every single one of those boxes but the developer from application a they may need SSH access into an ec2 instance in their account but how do I make sure they're not over in B's account or C's account we're promoting things how do we do our base image how does it go up from there where where does this cluster actually turn into something that's manageable without needing to hire another 50 DevOps guys or Ops guys just to try and manage it we we tried to bring it to something like this where you have multiple accounts and adding user accounts and Google Cloud just we have the redundancies we have data centers close to where our users are it's all the same issue
How do I manage that access how do I know what's in this account running in this account is the same release as what's on this account it's very difficult without the correct tooling so he wrote some and it sucked we weren't doing this as Nathan calls its ride the Rockets we were re-engineering the cloud to try and make it work in people's data centers mind set in the data center people look at go okay I have one firewall here I put on my blocking on this firewall comes to the cloud it doesn't work that way you have different defaults between Amazon and a juror and Google Cloud of if I don't have a security group what access is it some it's wide open some it's only what you have open but we also have the issue of when you're running a Redis cluster why are you running your own register managing that own the underlying resource when you could use it as a service or use it as a service it's always updated you no longer have to manage it you now no longer have to worry about the underlying disk space or the hardware or it's got to rotate Amazon will just rotate it underneath you migrate the data and you don't have to worry about that that frees up my developers now instead of three months turnarounds from release to release we're finally getting closer to a one-month turnaround as I was saying most accounts they start with this we've got one account we just took our data center popped it it's a POC don't worry about it we'll fix it later in my world POC is prod on completion that never gets fixed they just keep tacking to it so okay we we now did are a couple of POC applications we're just moving more people in now there's new rules how's the guy on the left no keep his room safe from the guy on the right you don't you start having that weird this shelf is for me please don't touch it stay key in the refrigerator then someone comes in and you got to start putting rules in let let's enforce these cake guys you know don't put your milk on Bob's refresh elf that he gets really personal about that section that's where his beer goes but during that and managing all those little rules everyone else gets slowed down nobody wants to have the time to do that especially when the rest of the development process is let's do commit to prod right away right away let's I need to be able to do a hotfix now not in three weeks we're looking at doing that with or you can do that with your whole environment with your accounts we end up with this nice multi tenant where yeah it's all a cloud account but everyone has their own apartment they could do whatever the hell they want in their apartment as long as they follow what the landlord says you can only do in this apartment once we got that figured out and got people kind of moved in they realized I wasn't really being a bottleneck anymore the security guy wasn't stopping them the Ops guys aren't stopping them from doing it which meant they actually started listening to us we stopped getting blamed for well it worked on my box
It must be something operations is doing well it must be something Security's doing because of a policy they put on no it it's still a buffer overflow your code still crashes but it also allowed their developers to do public tutorials they could follow something how to do this in lambda and they didn't have to worry about trying to find time with an Operations guy to get the permissions to figure out what they actually need to create their specific permissions to go in and do that in the stack they had an account that they could play with it then they come to me and say we'd like to get these permissions that we dialed it down this is exactly what we need can we put it into prod for our after work absolutely let's do that you just saved three weeks of meetings and emails and people freaking out because I've got a release that's got to go out it's gone now the app teams are their infrastructure team they're responsible for their software they're responsible you know they don't care about the hardware the network it's there what if it's working they don't care they spin up whatever ec2 they they want that falls within our policies of you could do this this or this their software runs it the cloud team is there to teach is there to help it's no longer there to be the bodyguard as saying no I've got a hole keep you from doing that I I need to push this button over here and most companies that cloud team is eight people for 30 dev teams 40 dev teams there's never time for that now we were able to work together and we're able to seamlessly go onto his your onto Google and they don't have to worry about it but to do that we need policies because being in security I don't feel comfortable with just saying here you go have an account I'm not gonna check on it again let's hope everything stays good next week so we have some things like you must have this as three buckets must be encrypted except if you're hosting a public website off of it but you know there's there's always these little exceptions but every re Enterprise I've worked with has had these things of you cannot run this version of Oracle or you just can't run Oracle because we don't want to pay that licensing run these other databases in the cloud no problem so that's a must unless you get your upper-level.
Exact to say you can the shoulds are stuff shut things down when you're not using them let's we're gonna help you on the way we're gonna give you some of the best practices but sometimes the breast practice is that your cloud provider releases don't actually fit your solution and I've had people tell me that yeah the best practice is to do this and it doesn't actually work or we spend more time trying to fix their best practice than if we would have just let someone run up because of that there's always those exceptions but with the exceptions you use those to learn it on the ops side they learn from the developers of this is what we're trying to go to next oh this is the new hot language that we're gonna do the Ops guys you traditionally have gone crap how do I support that now it's the we could do that but if you do it slightly this way in your design we could eliminate all this work from you buy we use this other as a service that which turns our requirements are good we just at the beginning of the project say this option s3 must do this thing as long as you check this out your code review before you go into production with security and operations and onboarding with the Tier one support 5-minute call that's the longest I've had since we've moved into this model and since most of those have been predefined of this policy does this we can just breathe right through it and only have to talk about the couple of things that are different any of the custom that falls just outside of that line and because of that we end up with a whole bunch of tiny accounts but their blast radiuses are like this big there is no great you broke into a sandbox account it's not connected to anything but other sandboxes congratulations you're not getting my corporate secrets you broke in a non prod well ok now it's connected to the corporate network maybe or is it just connected to only that little section of non prod because that was some company we acquired and haven't actually rolled it it you know when before you broke into one box and we had to consider absolutely everything in that account compromised on the security side if that happens now in one of these accounts we have you know five hundred boxes a thousand boxes dude run an audit script against instead of 40,000 50,000 so meant so much more work has been time has been saved because the only people that have access to those accounts are the people that need it
So we got where Mike is that basically we've got to a place where we've isolated all the different workloads into their own Amazon accounts and when we say Amazon account there we mean root accounts like you're in a multi-tenant environment so basically you're setting up each of the separate accounts on their own just like they do with separate customers so we're effectively treating our own internal applications in a model of non trust basically right the dream right from a security point of view has always been to isolate our internal apps from each other to create those boundaries of separation like yeah Mike was talking about so we have that hard blast radius around each one we don't want a shared SAN that can bring everything down infrastructure change control boards evil and slow everything down so drastically that if you can start to remove those things by isolating those workloads you get so much freedom for each of the teams right as you then discuss we need to then start thinking about policies for how we're gonna let them work.
If we're letting people have that direct access what rules are we going to put around them for how they work so we set policies like s3 must be encrypted you know you can't create networks you can create ec2 servers but they must use the appropriate and approved AMI you can log in but when only with SSH keys that are tied back to an ACME active directory type account so we set policies our rules for the environment now what's key for those rules is actually you know to be able to start enforcing them in real time the fundamental shift we've just made with the change of cloud like we used to Deveaux slightly our code through and deploying to servers somewhere right but as we went to cloud now our apps own the infrastructure the infrastructure has become part of the software auto-scaling load load balancers lambda functions whatever it is you choose to name those things now are just elements of the software we can't trust a hardened yeah infrastructure that we built once anymore and we managed in a backroom right yell at people occasionally when they want to request something to put them through a six-month procurement cycle all those traditional things don't happen anymore we're now in like I want 2 X 3 bucket for a petabyte of data like we'll just request it 10 seconds later it's there I need 5000 servers just do it right the key question is do you have budget and then from a security point of view I don't care provided you know they are hardened set up etc and the hated doing that he's putting god rails around people so how can I let you have all of that freedom but within the rules that I really care about because there's a whole bunch of rules I used to care about capacity management capital so it's gone now I just can't you've got the budget for your department and I can bill you right but now the rules are basically how do I enforce those guardrails on you and more importantly I've got to be able to do that in real time you can now change your infrastructure in real time the idea of a manual review I'm gonna sit down and have a meeting with you to see if you can launch that ec2 server that is ridiculous right in this world so we have to completely reframe how we thought about those security controls there's compliance controls those operational behaviors into a world of real-time software it's not human anymore it's software and that's why we need policies that's why we need separation of those workloads and that's why we then need to start implementing those rules as software based guardrails and so a guardrail really means detecting something's wrong and it's beautiful
If we check that's the classic joke about having a security guard right who's paid to say oh you're being robbed right it doesn't really help unless they go over there and actually stop though the robbery rights detection is only good if you have correction and more to the point in real-time right second of all if you actually do that you get to a place where you're no longer running around why are we having 4,000 meetings about how to secure an s3 bucket or a PC to serve reviewing every project it's ridiculous we've solved that problem a long time ago and if we can set the right policies with the right guardrails we know that every single one of those things now and into the future will meet that requirement right so by bringing in those detect correct those automated rules in real time we get a massive amount more freedom for those teams more importantly we can give them access to their native tools that learn by doing thing and Mike was talking about the critical part of that is when a new person joins my team I don't need to teach them how to use an s3 API they've done that at their last three jobs at this point but if I've abstract adalah that into my own concocted internal process oh yeah you can use s3 but you must create this thing use that policy here do this form you know whatever
It is even if it's like a push it through a stack or something you're still forcing them through a process that's slowing them down requiring them to work your way right you've broken google search for them as a developer you've broken all the patterns they're used to they can't use open source anymore because they've got a there only can you work the way you work in your awk so you've got to unlock those native tools for them and then tie that up with appropriate guardrails that know how to talk in that language right so that's what we mean by being native to those services and tools use AWS as AWS use Azure as Asia use that last scene as Atlassian don't try to abstract them which just removes value and slows things down now once you've separated all those people give them those guardrails created those environments the other thing we want to do is actually help them right so if it is beyond you're on your own which by the way most people are high-fiving you down the hallway about but you want to get to a place where you're actually speeding them up not only you can create a bucket but here's the best way to do it right he's a good three-tier stack that we like in our organization he's patterns that you can use at scale across your accounts now you notice the big thing here that's different from traditional enterprises I didn't say let's create a common service we don't do that anymore clouds do that first now sometimes we might if we've really shown there's a reason why it makes such a difference to org but we're no longer saying I'll create a comment enterprise service for you and you'll use me by the way you're on now subject to my change control right that's those are the problems that you start to come into once you do that instead the question is
Here's how you can deploy Redis in your account right not how you can use mine here's how we can run thirty-five small RDS instances of Oracle rather than one big spanking Oracle in the middle which is hard to charge back hard to manage etc now got schema control problems so what you want to do now is start to order made out those patterns as guardrails things accelerating you create an ec2 server I'll create a cloud watch alarm to monitor CPU for you right and tie it back to an alerting system which means it will end up in your lap there's a whole bunch of things we can do to accelerate those teams beyond just securing them right and it's all about having those real-time controls so what does a guardrail mean how does it work so this is an Amazon example of course this would work with other Google other things as well but quite simply it means let's watch what's happening in the environment so if you create a bucket we should see that straight away if you'd launched a server if you change an iam rule we should see all this thing straight away and then we gather those up and move them through a process of dealing with them by the way doing this is not easy every single region of a hundred-and-something accounts caching all the events wiring that up tying it together making sure nobody breaks it right because if they broke it your guardrails are broken so now you're out of control right there's a whole bunch of steps you have to do to make that work let's assume you get all their going the next thing you do is of course land that in SQS right or something like that way you can handle it you could write lambdas but now you'd have to write hundreds and hundreds of lambdas across hundreds and hundreds of accounts so things get kind of insane that way but you can do that what we prefer to do is bring it to a central place and one of the key reasons is that you really need context about what just happened and decisions about the policy oh you turned off your kitchen on your bucket alarm start going off but oh you're the public website account
I came to a problem right without that context I can't make the decisions I need to make for those guardrails right and subject to the policies so I need to know who you are what bucket it was and then you know what are the policies in that environment right with those pieces of information including the event we can now fire off the guardrail handler right and take appropriate actions turn on encryption say nothing to do here whatever which is appropriate of course once we fire off that guardrail Handler and change something we create a new event and around we go again right so we have this constant looping of humans injecting change and the system itself injecting change right to make sure that we're getting to an end state we want to log all of that just so we know what happened makes security and compliance happy but more to the point we need to tell our teams and our developers what just happened because otherwise they see all this stuff going on and they have no idea why did that happen how is that decision made how did those guardrails take effect so the automation is fantastic but it's critical that you have good visibility out of those guardrails as they flow through that is definitely something that has caused a lot of meetings of I'm trying to do this and it changes on me I guess now we need to start informing the audit trail because I'll point to you where it's at now we got to get to the point where let's make it more visible to the user how else do can we do it developers don't want that the hard thing is when you have the automated tool like your Jenkins job pushes out a new stack congratulations
Your automation tool does a bunch of changes to it and then goes hey this Jenkins user just did this I'm gonna email him on the changes instead of I'm gonna email the dev team or found out who pushed that commit for the kickoff the CI CD pipeline how do we get to that point right so beyond guardrails the next thing we want to do is have patterns of scout we alluded to this before right how do I have a repeatable idea of a Redis server how to make it easier for you to do these things right and more than patterns at scale one of the things we've found most useful is creating simplified common language if I can say to you they're an admin and you know what an admin means versus their read-only or they're a super user we're in a very high bandwidth conversation now if I can say that's a private subnet versus a DMZ and we know exactly what we mean by DMZ right and we all vaguely know what I mean by DMZ but in a world of members like well does that mean direct internet access does it mean in that gateway does it mean public Internet gateway does it mean public IP addresses are we using private endpoints right there's actually eight different subnet types you start to want to have in that environment creating common language allows you to drastically increase the speed at which you can do those reviews and those assessments because the only alternative is you end up reviewing a hell of a lot of JSON code right in a room together and that doesn't work at scale the next thing then you want to be able to do is make those patterns move across very quickly I won and I am model that's consistent across my accounts I want networking models that are consistent across my accounts and by the way if I improve that model
I'd love to have it automatically fixed across all of that if I change my default security group it should everywhere right these are the sorts of patterns you can do you don't have to centralise the service because now we're in a world of automation the fundamental difference is we just have a physical server and then one day we realize that virtual machines make a lot of sense we used to have a via physical data center now we've got a hell of a lot of virtual data centers right which we're automating out it more and more scale for our different apps right once you get that going you never know well we are accelerating our application teams with patterns that help them you're not in a world where you're saying you must use my service you must sit in my space if they don't fit you they can work on their own anyway right but if they do you drastically accelerate their speed in the environment and those patterns also allow the devs to cross things with hey this is how we figured out how to use sqs the way that operations on our own does it look at my git repo here's the code to how to do this my lat one of my previous positions we had a 60 gig repo of all of the things that the dev teams were able to share of this is how to do the best practice for all these different situations but again it was such a large repo we were breaking things for people trying to clone or doing a push because you can't keep that up to date so patterns great if you're gonna commit your patterns to a repo or or share them within your org slice them up
This is a database repo these are all the things you need to deal with an RDS instance this is the repo for ec2 instances don't cross-contaminate it use that and your stack set up that way hey here's an example three-tier situation everyone likes it it's great it's groovy let's go but searching through some of these really large stacks it's like searching through those giant that giant account I was telling you I found people that had had left five years earlier when I was splitting those accounts still had active accounts still a data active access keys to all of the things yeah Sobek common language this is just an example right so if you think about identity and access I mentioned it before try to find ways to simplify those models down the repeatable ways so if you can come up with a definition like this one like ona this is a super user versus and am investing operator and you know sqs operator okay they're going to do low-risk operations in an sqs environment right you can start to do powerful things if you cross-section that with a hierarchy right you can do even more powerful things you can start to say well the cloud team has metadata access to everything which means they can't see any data but they can see the configuration of those accounts right throughout and then you can even go very very low to give high degrees of access give you get access at admin level prod you get access at read-only level right there's a whole bunch of models you can start to do with that to really simplify that language because the only alternative is to start having very very complex combinations of custom language in each of your environments right which is difficult to deal with one of Thun things you'll find too as you scale across clouds is of course they each have different ways of handling identity so AWS is very flat every account has I am basically at a global level a bunch of customization Google highly hierarchical right with inheritance as a half hierarchical
I'd say right not at a tenant level but a subscription below so there's different models there you can start to intersect right if you don't have common ways to talk about them you're going to quickly unravel and all your conversations are going to get drastically more complex right the next lesson we mentioned before with the ability to see what's going on but this visibility idea is so critical I mean we're all used to audit trails we did them for security and compliance for a long time turn on login you know it'll make everybody happy that's a good thing but now we really got it we've got an application that's changing by the minute oh it ran 50 servers for the last hour and then it dropped to 25 right looks good now but what was going on last hour how big was it starting to understand how your application is changing its infrastructure in real time becomes something a developer needs right they need as much visibility as the audits I did right and that also applies to automation starting to affect it as Mike mentioned before I create an s3 bucket and the automations encrypted it what happened why was the decision made right and how do I see that history for that and for that automated infrastructure in that real-time change right
So basically what you end up with then is needing high-level visibility to understand what's going on but then combined with increasingly deep levels of logging so you can really get through and troubleshoot that down but the idea of just logging it is no longer sufficient the idea of doing pre approved changes I agree that you will use routes who temporarily changes this network and I trust that probably that's what you did right is not really sufficient in a world of cloud where we actually can do a full get style log of everything had changed or happened right we're in a whole new world of power and visibility they're back in my network days we used to use tools just to watch what happened on the switch is during the night and I was surprised when I moved over into operations that we didn't have that then we get into the cloud and up until recently we still didn't have that right so one of the things we happen to do is basically set up like I said that get style change control of your infrastructure if an s3 bucket changes if an iamb rule changes whatever we actually have a differ on that right over time this is quite fun actually what you've found recently all of a sudden all of the AWS accounts I am rules change and like what's going on looked in the diff history Oh Amazon just changed the limit for the number of groups you're allowed to have from 100 to 300 I don't think that's been announced still but I'll tell you it's true because we've seen it across hundreds of accounts now with a diff to prove the change falling out we would never have known that change had happened in the environment without this type of tooling right we've constantly surprised by what's going on underneath us so after you've done all those things you've isolated your workloads you've created your policies you've got this idea of teams working together to learn and do that stuff right try with experiments under exceptions comment come together start to get visible in what's going on you're in a place where you can really start to automate it fast and at scale you're standing on shoulder allow you to move quicker you're not stuck in constant review of the basics right and you can start to add more and more decisions
It's like capital investment that you keep getting better and faster at the basics and you better do that when Amazon setting a thousand new features a year and by the way the company also wants as your Google chucked in the mix and you're trying to deal with all that if you're not automating out that core good luck right to keep up very very difficult so what we've seen is that that ability to automate really allows you to scale and in particular we love to hear language from teams like kill the ticket not close the ticket I never want to see this ticket again ever in this environment if I know how to respond to this type of event with an automated response I should never see this ticket again it should happen it should be fixed and it should be closed right and anyone who's worked with level 1 or 2 particularly large enterprise where it might be outsourced in different places if if it can go to level 1 or 2 that basically means you've got a document that says exactly what they're allowed to do and what steps they're going to take if it's done well normally it's not even done that well but if you've got it to working well but if you're at the place where an alarm goes off and the question is all you may be out of this space right we have a temp drive right maybe do this maybe grow that check with the application over there those steps actually can be completely automated right that's a relatively challenging example but basically if you get to the point where you're starting to know how you want to respond to those events level 1 and 2 or automations and everything else needs an application owners input to be able to do it so I just hit a point where my operations are automated and everything else goes to the app team and that sounds a hell of a lot like good DevOps right where if you're uncertain what to do you go back to the development team otherwise the operation team can work within the constraints they've been given about rebooting loss etc but if you can codify those rules you can automate more and more of it out right so you're on the path to getting to that final place and I found a hidden benefit of that is my devs are on call which means gets fixed in a hurry because when things break in the middle of the night okay yeah
I'll just fix it in the morning look into really what it was I've got the service back up next night two o'clock in the morning breaks again guarantee there's gonna be commits to that thing being pushed into the build process in the morning now something that was hidden that tier-one was dealing with for years got fixed in two nights because nobody wants to be on call and lose sleep we want those magical on calls where maybe I get a page at lunch on Friday and go hey it's two minutes so it's your turn right so when we bring all that together we you know at said what we really think of it as software-defined operations you've got your software-defined infrastructure you've got to change your outlook down to a world of software to find operations nothing else will have the speed nothing else will deliver the agility that's required for those teams right you have to start thinking of your operations differently and you end up in a very simple loop which is basically that your application teams are working directly with their applications on those cloud environments they're happy they can do their job they've got the pieces they need they can use their tooling they can Google search results right they've got the native capabilities they want it they have the agility right you no longer stuck between them and the thing they're using right as we cloud teams often happening making the dev teams happy and the cloud teams less in the middle of course great power comes great responsibility some dev teams are happy others start to panic right what will I do if I don't have to have a project manager arguing with your project manager about the procurement of that server for the next couple of weeks I'll actually just have to do it in the next two minutes and now it's all I need to deliver the app right there's a lot of you know water level reduction if you want to use a lean method we've seen our I've seen several different styles of dev teams one is the great get on my way let me do this and the other is but without ops how will my broken toys go get to fix right because prod is where our broken toys go to get fixed right and that's where you got to come back to that teach don't do you start doing you know quickly expect you to keep doing right and they won't learn right so you can start to get to this place when your apps down because an availability zone was down
I'm like well you had to you chose to use one this isn't really an operational problem this is an application high availability decision gone bad right you might want to review that and think about what you're doing next right changes it now the cloud team's job is to help it and teach that application team better know the cloud better than them better know those tools better than or at least be able to keep up with them as they're learning it right take those learnings and bring them back and share them others through automations right you're in a world of helping not request fulfillment that means you're now standing side-by-side solving problems which completely changes the alignment of the organization and the focus really hard to explain that until you've felt it and sat there but it's a big difference between looking across the room going when's my server arriving versus sitting there going how we going to get this ec2 service fun up with this thing delivered that we need because my job to do it and you're my job to help you do it right and now we're in a different type of conversation and interaction so you end up with a cloud team helping them and then you've got to feed a lot into automation use a tool like it's about to accelerate a lot of that automation right or do some of it yourself but basically that allows you to start scaling that solution very quickly
If you don't do the automation you can never keep up right and you're stuck going back and forth and then you end up being the guy that sits there that changes users passwords on 30 different Amazon accounts because here's her tier one guy buddy fat his password expired when he goes on vacation and now he can't unlock it or as a broke his MFA and someone has to go into all these accounts and do it you haven't automated it before guess who gets to do it now right so what was going to do because this changes now to a world of software I was just going to take a moment to actually show you what that can look like this one example you can of course build your own tooling right to try and do it differently and I'll highlight a couple of things we spoke about so a turbo it basically sits inside one of the applications VP sees or the Amazon accounts of the customer right so it's running in there environment your login with an active directory so you've really got identity hooked back right that's a key thing you want your identity hooked back you separate out all your accounts whether they're Amazon Azure Google doesn't matter you create separate environments different developers see different lists of the different accounts that's the isolation the workload isolation we spoke about we're not sitting in one Amazon account anymore I see my three servers you see your four servers I'm not stuck in a world of the most complex tagging scenario you could ever imagine right we just have really simple things my costs are my costs your costs are your costs right that workload isolation with the multi account is critical right and can be enabled through your tooling and most importantly the ops teams can get access to all of those by giving them their love their user account permissions in one spot someone comes or goes you're new to the team two minutes later you have access to everything the rest of the team does it's no longer this three weeks maybe we got all your access in place oh crap that guy left now we're gonna be working all weekend because he ragequit and we have to pull all his access audit all the boxes all the accounts right yeah or if you add a new team with a new Amazon account within the first two minutes it's deleted all the default VP sees from around the world set up you know I am policy permissions or password policy et cetera you're not worried about that stuff anymore because they constantly enforced as guardrail right
Now after you're in that account you start to see a couple of different things happen the first thing is we don't really want developers sitting in that abstraction tool you want them using their cloud so you just log they log straight into the cloud using their permissions we know who they are through that identity system these are good things they are themselves in a native cloud experience right they can use API stuff like that they can in this case for example our permission of create an s3 bucket this is basic stuff we're all used to rotor if not you should definitely get into it so we create a bucket I'm doing a lot of to HUD clicking here in that environment now notice I didn't go to a forum I didn't go to an abstract a tool I'm using the tooling I'm used to these are dumb obvious things but these are critical as you think about in this and solutions for your teams at scale right otherwise you're gonna have to teach them and support their basic ability to do these things so I created myself you know that new bucket last con - now what's cool about that one I did that is in the background we've already detected it and fixed everything so towboat detected the new bucket when our versioning is not on policy requires version you'd be on turned on versioning server access logging wasn't enabled now it is right so it's actually detected that and the guardrails I've acted in real time to give the configuration you want oh we need tagging to do the cost center and stuff it's all just automatically implemented you can start to get all those sorts of automations happening so people aren't burdened with all of this junk anymore and you're not running 45 reports on all the s3 buckets it's just done right policies and permissions oh the bucket policy will be our policy required encryption at rest and encryption at transit those things are enabled immediately on that bucket and this is a thing my devs lug loved because they don't have to trade a track down the latest JSON of what our policy is there they could just use AWS s3 make bucket bucket name done it's ready for the application it's not how do I find all this other thing oh I've been using this script that pulled this from this git repo but that hadn't been updated or someone broke that somewhere it's gone now to get to the visibility peace to us detected that we made those changes and it's recorded those straightaway so we just created less pontu bucket soto bot knows that Nathan created that bucket and to pull up that history here just doing one second and then basically it also knows that it updated it and made changes to that bucket in real time live there we goes so basically we created a bucket it tied it back to the account knows it was me and then it also knows what it changed
I would added logging it updated the policies it added the tags etc this is the visibility we're talking about where people now know what's going on in that environment and they can see their history in addition we can see in the history of the full activity we can see we created a bucket Oh tags on correctly opened an alarm right we checked other things like whether the name correct version was on so we open a bunch of alarms and then we automatically corrected them right and then we automatically closed all the alarms so this is a well where you're in true ops now in real time right where you create an alarm you fixed it within seconds right you notice all these things happen in the same minute right if I dive into one of these like for example the tagging we can see the history of what happened the alarm was open the tags were closed right and then it you close the alarm so we have that history of everything going on in that scenario that visibility is critical to those teams in addition when you want to troubleshoot you can really go down and see here's the event that led to it and all that stuff right again this is one example but what's critical here is that your team to know what's going on they can understand why and they can see all of that this isn't security and compliance auditing anymore this is change history in real-time automation that your Deb's require right when you're working with them in that way and on the security side it could also be a heads-up because if someone just starts spinning stuff up in a new region all sudden my sim is pulling these logs and saying hey normally you're in these five regions what the hell are you doing in ap Southeast - right I'm gonna flag this anomaly someone go check on it pick up the phone call the guy who's tagged on the accountant is oh yeah we're starting the POC build for this right -
See what the latency is I never would have known that beforehand so now I know all my other tools are getting ready to start showing this my other teams are gonna start looking at doing this I need to inform the network team that there's gonna be a whole new set of traffic coming in but we now know instead of the hey release releasing in Australia tomorrow I need you to approve all this and do this emergency change even better than that is you can actually set what's the approved regions all right and if they create a bucket outside that region it's deleted within seconds and nope right if it was on more than half an hour old of course you don't delete it you just relieve in an arm right because you have rules about how to automate those decisions that you'd normally do with humans right so you can start to do that stuff now from a control point of view you end up with a summary then of the status of this bucket is it up to date from the CMDB does that have a good name is logging on etc you can see the status of it and all of that is subject to the policies we were talking about the controls are like the guardrails the policies are the rules right so we can say rules here like for example well I don't this bucket is exempt from the encryption at rest rule because it's used for public stuff so we set that rule to be non temporaries all that sort of stuff because this is a world of exceptions right but we just create that exception now what's awesome about this is if you do the right Tony you need to be able to see where exceptions are so if we go up here as a security team I can see every exception to the s3 encryption rule in my entire environment in one place right so as you start to build those exceptions to those automations make sure you think about how to give visibility to those teams across those different parts right what's even better though is once I change that rule if I just refresh we'll see that in the background it automatically changed the policy right so I didn't have to think about any of the JSON or change any of that the policy is able to handle those things for me right so just to close a couple of couple of other quick things so basically if we go to that account the other thing we talked about was having patterns so one sample of patterns we mentioned was the permissioning so
When you think about permissions you can start to simplify them down so dynamo DB admin owner metadata right s3 admin meditate like common language makes this much easier to understand right to be able to grant and handle in that environment you can simplify this stuff down if you do the right balance on those things right and you could extract it down to the OS level the database level where now you're still using those same patterns but everyone in the tier one team now can SSH into those boxes right period from a visibility point of view give some thought to things like search so for example what's all the last con stuff in the environment we can see there we've got like - s3 buckets and an account calls that that allows you to search by IP stuff like that so you can get straight through troubleshooting to the issues you care about quickly right and the last thing you know as you build or Meishan tools those dev teams need to automate the automation right so that they can set policies that they're allowed to set so they can set things with their apps so make sure you're thinking about an API to your own automation and stuff like that right at scale because otherwise those dev teams unit coming to you all the time and saying well how do I do this and I need this human post policy change at the same time as this right so it's a you know you've got to think about that sort of visibility automation of the automation pieces and where your notifications are going to go how this team over here wants to be notified is completely different on how that team wants to be notified right because those guys may be slack power users these guys may be HipChat those guys want a phone call right so how do you manage that and how do you allow them to set those that stuff how do you know who to contact right well now you have to have tags on the account right do all those tags need to go down in the instances no but they need to be available when I need to make a phone call yep so to summarize basically what you're looking at is an activity system use a very fancy Mike Mike reporter type strategy idea really you're combining things by isolating our workloads we can ride the different rockets that we've have available we can give teams more access
We can then learn with them by creating policy exceptions there's a whole bunch of pieces here that tie together building an incredibly agile whole but if you start to undo parts of this and go well no visibility Automation is much much harder to use well I can't isolate workloads I'm like well how you gonna do self-service right you so there's a bunch of pieces that interact and you need to think about your architecture a little bit holistically as you tie those together but once you do that you get the ability to move at that speed of cloud you've got those common patterns your mailing your security and compliance story because it's automated in real-time you've got cost control by separating out all those different teams and now your teams are using the optimal skills they're not lost in the weeds of the basics they're not learning your stuff they're using the standard cloud capabilities and that really aligns them up with the cloud team and lets you move a lot of agility right so that software-defined operations really unlocks that software fun infrastructure in a different way and any questions this seems um from from what I'm seeing at my my work that this is a more of like a developer op secure saying so is there what's looks like the pushback you get with the actual current ops team do they feel like they're gonna be reef act back drives like are they are they scared that the developments development each side is making their own pipeline and essentially and are they scared cuz I'm seeing it in my environment it was the better ones weren't because they went this helps me do my job I can move faster I can now fix all of these things instead of having to yell at these guys that it's not working the guys that were just kind of coasting well most of them are no longer with the company but they freaked out because all of a sudden we're making them actually do work we're making them talk with the developers to find out what's going on and not just give them a finger and tell them no no you can't do that why not that's the standard everyone else agrees that standard I don't want to be in meetings all day to find out yes we're going to be doing it anyway exactly there's a fundamental power shift right which is difficult to handle infrastructure had a lot of power they owned the capex right and then and they they owned command capacity the power now is shifting more the application teams right so there is definitely you know challenges of how you think about that as an organization I think Mike's right in terms of some people rise to that others shrink away right but what's exciting about it is that as an option person there's never been a more interesting time to be alive I mean like you're sitting there now and asking well how am I going to use these 100 services I used to use databases and Linux servers probably windows with an image we developed six years ago right and now I'm like sqs queues this that I mean architectural pattern things and I'm helping this team and that's a highly dynamic place to be if you want to go there
If you don't want to go there then often it's better to keep looking after the other stuff right for a while well and one of the things that I neglected to point out earlier is once you pass a threshold of like 15 accounts this kind of stuff is crucial you will never manage it when you get six seven hundred accounts under underneath you there's just no way unless you have some intern whose goal it is just to manage access and if they I could find that person that I could pay 30 grand a year to just manage access to everything I've got my unicorn man other questions yeah yeah so with central services such as this and AWS organizations how do you test canary and verify any changes that you're going to make before you press the button in the central service and make sure that everything doesn't fall over all at once yeah it's a fun it's a fun question right and not easy to do so there's a couple of approaches one is basically to make sure that you're separating your policies by account environment so you can start to set rules and only enforce the rules you want to do in that environment before you move it to the other environment we think of that as resource groups I didn't talk about them too but basically a way to have policy sets that apply at points in the hierarchy that allows you to test policy sets in that area and then move it to another area more completely in terms of a basic upgrade of a central service right one of the other things you have to do there is except one thing this central service is critical to providing access and providing guardrails but it's actually not critical to the ongoing operation of the tools I'm using yes it has great power so it could in theory blow things up right if it did the wrong rule at the wrong time yes right but basically at the end of the day it's really controlling access and things around it rather than sitting in the core operational capability of that service right they are running directly on their own and it's sitting to the side and managing them right so worst case it's more like a fat fingering than it is like a it brought the whole thing down at once right that's not a complete answer of a big discussion there but it's about you know you try to do gradual rollout you try to do that the other thing we of course we do is have customers run dev environments of the central service to test some things they're separate from the prod one right and often will separate we work with a lot of fun sue companies for example they want slow change we'll separate that out of and completely different version of the environment any other questions yeah okay let's get a round of applause for Michael Nathan you.
If you need any assistance, let us know in our Slack community #guardrails channel. If you are new to Turbot, connect with us to learn more!