Case Study

AWS re:Inforce 2019 - Cloud DevSecOps - multi-year implementation of cloud automation

Mc-Graw Hill discusses how to effectively manage cloud operations

Turbot Team
5 min. read - Aug 09, 2019
Mc-Graw Hill discusses how to effectively manage cloud operations

Disclaimer: Automated Transcript

[00:00:00] Good morning everyone.

[00:00:03] My name is Chinmay Tripathi. I'm the Director of Cloud Engineering for McGraw-Hill. And today's talk is: Lessons learned from the multi-year implementation of cloud governance automation. And we'll be talking about how we effectively manage cloud automation, operations with regards to security, networking, IAM controls, operating system hardening, and patching. And I'll be joined by Nathan Wallace who is the CEO and Founder of Turbot.

[00:00:43] All right. So some of the lessons learned. And before we get into the lesson learned a quick introduction on McGraw-Hill. McGraw Hill is the learning science company, and our data scientists are constantly curating billions of data interactions. Which are created by our students by interacting with our digital products. And we are trying to identify trends patterns and opportunities to improve learnings for individual learners as well as helping our teachers to teach better. And we partnered with over 14,000 authors and educators including 50 Nobel laureates. On the technology scale, we have more than 80 dev teams. More than a hundred database accounts. More than 10 Kubernetes clusters and 4000 Amazon Compute Cloud (EC2 instances). So, how did we get to hundreds of database accounts? I think that is going to be the central theme of this talk and how we manage operating in the cloud at this scale.

[00:01:51] So chances are you are in the cloud because you want agility; you want elasticity; you want auto-scaling; and all that good stuff which cloud brings to you. Most of the organizations start small. And we did that too. You start with like a small POC, maybe migrating one or two applications in from the data center into the cloud. You start with a small team working on a shared goal and the goal is to bring the stuff in the cloud. And eventually, you are successful. After a few weeks or months, you build a team you build a pipeline you build services around it, maybe are using config management tools like puppet, salt, or chef. And then you grow. You are now moving maybe 10 or 15 applications into the cloud. Now you have, you go to stage 2 which we call a Share House. What it means is that you have multiple teams working on different projects in one single database account. And this model works as well for some time. And then inevitably it happens where somebody makes a security group change thinking they are doing it in a dev environment. But that security group is shared by a production application. Or maybe a routing table change or maybe a national rule change. Or maybe somebody updated the AMI and updated the Java version thinking you know, it's not going to effect everybody else but that AMI is shared by their applications. So that model is also not working. So you need to separate your workload the production workload from your dev environment. And that brings us to the next stage which is Hosted Service. What it means is now you split your environment into Dev, test, and prod kind of model and you separate your workload or production workload from your dev environment. And this is the model we think most of the organizations stay for a bit longer than the first two models. You could go on for like maybe six months, a year. But this is also not a scalable model. Right. Now you're in the cloud. You have built a team, a centralized team. You can call it your cloud team or your dev-ops engineers. They're getting comfortable. They are building your CI/CD pipelines, building services. Your ec2 infrastructure, all that good stuff cloud brings. They have their the tribal knowledge. They have created a lot of services around it.

[00:04:22] But business demands more. Now you're not competing yourself from data center time. You know you want more frequent releases you want ten releases a day. Now when you scale it. To hundreds of services. How do you do that? You're this core team which has been doing your pipelines and managing your workload in the cloud. Cannot scale with the pace of innovation business want; business want more agility. You know, bringing more value to the business. So you and we face the same issue. So we try to solve it. We hired more people. But ultimately we realized that this is this model is not scalable. And that brings us to the next stage. So we looked into this problem and we thought OK maybe we'll let more developers and engineers give access to the cloud API because ultimately it's all in code. Developers can code their application. They can code infrastructure as well. But you wanted to ensure that the separation of duties maintained; you need a lot of security controls. And we thought we could do that. So we started writing IAM policies right. So that if a developer is working on a few services he's kind of separated from the other workload running in the cloud. Right. And we tried to write the IAM policies. It doesn't work. And the reason for that is. IAM policies at hard right. And if that is all you're doing constantly. Eventually, you know, it just doesn't work doesn't scale; you're doing back and forth. Too many change control meetings and you know, it doesn't work. So we'll have been to this we needed a better solution. So we looked to the model of Amazon. Amazon doesn't build a data center for every one of their customers. They have a multi-tenant model. So what we did, we started to give accounts to each of their teams or the or the business units and we ended up with hundreds of database accounts. So the benefit of that model was now we have two pizza teams in one account. Right. They may be running four to five services. Six to eight developers and engineers. And the blast radius is minimized now. And activity done in one account would not impact other accounts. So you have solved the problem of agility, the freedom, and now your piece of innovation is much faster. Which is what you need to do. You don't want to slow down your developers because of your security controls and operational controls and things like that. But you have solved one set of problems and traded it with another set of problems. And that brings us to the next slide. So when we look into the overall footprint of us. You know we had issues in security, cost management, networking, and develops pipeline. Things like how do you replicate all your security best practices across hundreds of database accounts? How do you ensure that not all the services that enabled in every single account? If a team comes to you and ask for an account and they just want to run serverless, maybe you don't need to give ECS Access or ACM access or you know, S3 access maybe. You want to restrict the services which is required in that particular account. A service-wide listening is very important. How do you ensure your cost doesn't go over the roof? So there has to be some kind of a budget enabled in each account. Maybe you need to create a certain key player in every account by default so that the future automation is easy. Maybe you need to create a service account later at some point. Or maybe you need to do user management. In security you want to ensure you all your AMIs are managed properly. Now you have your scale is so big and you don't have your core dev-ops team anymore. It is open across all your organization. So you need to have the security controls and policies and rules and regulations in which all these accounts can operate.

[00:08:34] Alright. So that brings me to the next slide. And we will be talking about some of the lessons learned in the area of security. The first is root convention management and I think at this point everybody knows that using root credential is a bad practice. Nobody should do this but I'll repeat it anyways.

[00:08:52] Make sure you are rotating your root credentials. Root passwords. Enable your MFA. Enable your security questions.

[00:09:02] And don't allow any application to use the rule keys in their pipelines. But here's one extra tip. To discourage people from using root. Have your root contentious manners by two separate teams. And what we do is essentially have the password managed by one team and the MFA managed by another team. That way they always have to coordinate and collaborate. If you want to log into your AWS account as a root user it. And of course, rotate your keys every few months.

[00:09:35] Service accounts. What we are trying to say here is how many times it happened that you were in the middle of a big product release. And everything is going well. Everybody is excited. You're running your pipeline job and it doesn't work. The job fails. You look into the logs and you find out that you know the credentials don't work anymore. What happened?

[00:10:00] The user which belongs which owns the connection is no longer part of the organization. So good job expanding the credential, but you shouldn't have used that credential at the first place. So recommendation, use machinery access for machinery jobs. If you're running a pipeline or a release or deployments use service accounts or machinery access like IAM rules to deploy your code and not credentials which is tied to an individual users because users are eventually going to leave.

[00:10:29] If you do use users, use them for POC, or for testing, troubleshooting, and stuff like that. The next is automated account configuration and that is that is very important. Especially when you're operating at a scale of hundreds of accounts. and you have to provision these accounts all the time.

[00:10:49] You want to ensure all your security controls your security policies are fully automated. Your VPC creation, your security groups, your user management. your access management, it should all be automated.

[00:11:03] And now we're going to talk about some of the security services which we have found very helpful in managing this. Managing our cloud environment. So before I get into that can I get a show of hands how many of you use flow logs, cloud trail, and guard duty. OK, that's pretty good. So to those who don't. I'm a quick rundown on these services. Flow Logs. How many times it happens that you are trying to log in into an EC2. And you can not log in, right. Happens with us all the time.

[00:11:38] It could be a problem in your network where you are operating in would be a local issue. Could be a routing issue. Could be a firewall issue in your data center. Could be an issue with the security group on the EC2 instance. Or could be an issue with your NACL. Or it could be a local issue on the EC2 instance itself. Maybe the user is not set up yet. Or maybe the keys have not been created yet.

[00:12:03] Use flow logs. A great tool for network connectivity troubleshooting.

[00:12:10] Basically captures all the IP traffic metadata including source IP address destination, port protocol, as well as whether traffic was accepted or rejected. So great tool. We use it all the time helps tremendously. But it is also a great security monitoring tool. So the challenge here is how do you take all these logs and put to a centralized location where security teams can analyze these logs and create some analytics or maybe integrate with CloudWatch Events to send out some notification and take some actionable or create some actionable workflow. So we can use you can use CloudWatch Logs or AWS S3 bucket and use Kinesis stream to maybe to send it to your SIEM solution where they get a single pane of glass and know what's happening across all of your hundreds of accounts.

[00:13:00] The next is CloudTrail. Another great tool from Amazon.

[00:13:04] You are troubleshooting; you're not maybe you are doing an RCA. Something went wrong.

[00:13:09] A security group was deleted. Nobody knows who did it. You're trying to find out. Enable an Amazon CloudTrail, and it will tell you every single database API call made in every one of your accounts. And who did it. What time was it done. What action was taken. Enables governance compliance and risk auditing tremendously in an account. So definitely a very important service to be enabled in all account. And the point here is how do you do it at scale? So you have to have a process when you provision an account. All these basic services need to be enabled by default. At the same time, you have to ensure that nobody goes and changes it.

[00:13:45] It's possible that you enable cloud trail and somebody is trying to remove the traces of their activity. You should get an alert and maybe have an automated guardrail to correct that mistake. So detection and correction immediately is very, very important when you're managing hundreds of accounts. You cannot have you know. Do it after two weeks or three weeks because by then it's probably too late already. And that brings me to the third service which is my personal favorite. And that is guard duty. This is relatively newer service from Amazon. So with flow log and cloud trail.

[00:14:20] You have a lot of data. You have all the API calls event logs you have your network traffic.

[00:14:26] So security is getting all these logs and they are constantly analyzing it and it's very easy for them to get overwhelmed. It's a lot of data. They can try but it doesn't really work very well in a sense it just too much information there. What do you do with it? How do you create your events?

[00:14:45] Guard duty solves that problem tremendously. It is a threat detection service which is continuously analyzing all the VPC flow logs, all your cloud trail events. And trying to do threat detection by analyzing these logs and finding any kind of malicious activity or you know a potential threat in your account. Within the first week of enabling guard duty, we were we had some major successes. We found out many applications were still using root credentials. Within the very first week of any of being guard duty. The other day we had we had an issue where we got an alert from guard duty.

[00:15:27] That some user has just terminated 20 EC2 instances. You got the alert started looking into it and we're like this happens all the time right? People go and spin up instances and infrastructure and terminate them. So what's the big deal right? We started looking into it. And found out the user had no business in that account. It was a mistake. He was supposed to work in a different account but he, you know, the keys got mixed up or whatever. But he landed a different account and terminated the instances. It's a great tool for things like that. Yup. SSH Port. It should never be open from internet. We get alerts from Amazon GuardDuty all the time. People make mistakes. Even then they know the best practices. Sometimes they're in a hurry they're troubleshooting something they enable Port 22 from Internet and next thing you know the Amazon GuardDuty sends and alert, and next thing you know you are getting brute force attack. So great tool if you have not checked it out please do so. It's going to make your life very easy.

[00:16:23] Automated security response. And that brings me to the topic of automated guardrails. And what we are trying to say here is. If you as an organization have a policy. To let's say encrypt all your S3 Buckets. Right. And you have all your best practices laid out.

[00:16:43] But now you have hundreds of people who are creating a S3 buckets every day. And chances are that some of them will not encrypt their bucket. So what do you do. Do you get an alert. And send an email follow up do follow up meetings with your developers. Are you actually go and correct it. That kind of security response has to be automated. If an AMI is old, older than six months or maybe a year do you want to allow your EC2 instances to spin up using that old AMI, or you want to stop and everybody who is trying to use an old AMI? So that automated security response is very critical aspect of managing cloud at scale. And that brings me to this slide where you can see how we have our environment. This basically tells you how our environment looks like. In the middle, you have the guardrail automation. And this is the security rules and policies we have in a framework and we use a tool called Turbot, which is always constantly monitoring all the accounts against our security policies and ensuring that the end desired state is always maintained. With respect to, and I'll give you some examples. Things like cross-account access; maybe you don't want cross-account access for S3 bucket or lambda. You want versioning to enable all the time; you want to ensure your access logging policy is maintained on your S3 bucket. Right. Your AMIs. You want to ensure that nobody spins up in a region which is not approved by your organization. So this guardrail is constantly working for us behind the scene. While we don't have to automate that we don't have to worry about somebody creating an S3 bucket without encryption turned on. Or maybe somebody is using an S3 bucket with no encryption in transit. On the left we have the networking automation; all the accounts get provisioned using automation; all of these VPCs, routing tables, subnets. Security groups created by automation in a repeatable and consistent way. And of course, all of this is tied to your Active Directory. So that you don't have to manage. IAM users in your account. When you are managing hundreds of accounts. You can do that. You have to tie it to some kind of federation. And people just use that to get into any of the accounts they want. And we use the Turbot for access management and user management. It makes our life so easy. And we talked about VPC flow log all that cloud trail and guard duty. And the way we do all these logs needs to be centralized. So we use cloud watch logs. And send it to all those cloud watch logs look to a stream and eventually stream it back to our Sim solution. Now our security teams have a single pane of glass where they can have visibility into every single activity happening in all the accounts at one place. They don't need to log in into new accounts. They have full visibility into everything. And that brings me to the next slide, cost management. I know this is not; this is a security conference but it's still part of our lessons and some of the aspects do belong to security. So right-sizing infrastructure is very critical. The cost in cloud can go out the roof.

[00:20:08] If you're not careful especially if you have hundreds of accounts people can spin up 16 extra instances and leave it running and forget about it a week later or two weeks later you might get the bill and you are like ten thousand dollars you know over your budget. So right sizing is very critical to scale in and scale out when the demand goes up and down is very important to keep the costs down. But Amazon gives you an option to purchase result instances and what it means is you commit to a certain instance uses over one year or three-year term and Amazon gives you a 60-70 percent discount.

[00:20:43] So use that. Some tips over there is good short term, if you're not sure about your workload. You can go with one year if you're not. Because this happens all the time and reinvent. People go to reinvent and come back. Hey. Amazon is launching a new instance type is pretty cool. Let's use that they come back. And spin up their interest section in the new instance type. And now you're paying double, you're paying for your recent instances as well as on-demand pricing. So go with one year if you're not sure by that way you you have an opportunity to correct mistakes after one year rather than wait. You are getting locked in into two-year contract. And that brings us to some of the other cost-saving opportunities. Monitor for your unused resources things like Volumes which are not associated with an EC2 instance anymore. And this happens all the time. People spend up EC2 infrastructure. Terminated volumes get left behind. They spin up ELBs and then leave it running for months. They forget about it because next time they gonna use another ELB. Audio nav gateways, audio snapshots. Individually these don't cost a lot but we were able to save hundreds of thousands of dollars by just automating EC2 volumes removal. So some it was like and we use the Guardrails as well for that. If a volume is not associated with an EC2 you tag it. You wait for a week. Take a snapshot. Put it aside and then remove the snapshot after three months or six months or whatever it is your rotation policies. And same thing is with your gateway as well they did they do cost money. Another tip, for your instances your dev instances your sandbox instances are testing instance, which you don't use 24 by 7. Shut them down. 8:00 p.m. to 8:00 a.m. maybe you want to shut them down and by the time your developers come in they can be up and running. Or during the weekends or when there is no testing happening your testing instances don't need to be up and running. You can use a service called Instant scheduler by Amazon to automate that part for you. And that brings me to the networking aspect of managing your Cloud Environment.

[00:23:08] So now with CI/CD these pipelines infrastructure as code, developers are the king. And they can build your infrastructure. They can build the application. They can run code at scale and things like that. But developers are not network engineers. While VPC creation is a relatively easy task; you make an API call your VPC is created, but there is a lot more goes into managing a network especially if you're operating at hundreds of accounts level. You need to spend some time thinking about your network. Ensure that you are using some kind of a design principle. Things like how many Availability Zones you want to enable in every account. We recommend to use three because that gives you high availability and you know if availability zone goes down you have some backup. But you have to have some kind of a consistent way of replicating this in all your accounts. So automated VPC creation, have a repeatable and consistent way. Make sure that naming conventions are followed. Tagging is right. Because when you want to scale your network and everybody is going to do that at some point and I'll give you a story. But it helps. Recently be able to roll out transit Gateway. And to those who don't know about this. It's a new service relatively new service from Amazon. It allows you to unify all your networks. It makes it a once single network. Now you don't need to period account. If you want the high bandwidth when you are making service or service calls between two networks. They can go or transfer Gateway ads if they are sitting in one network. So we had to rule out our transit gateway. In the accounts we knew where we had a consistent naming convention and tagging and consistent network design. With regards to VPC names and subnets and Availability Zones and routing tables.

[00:24:59] We were able to roll it out within seconds without creating any impact to any application across hundreds of accounts. But the accounts were which were all their accounts where things were done manually you know hundreds of subnets, hundreds of route tables, and stuff like that. It was absolutely a nightmare to update your route tables, without impacting your application. So we ended up doing a lot of hand massaging and handcrafting scripts manually verifying it detailed tables are correct before and after and things like that. So it will pay you off in the long run if you are automating and giving it some thought. When you're designing a network at scale. Things like do you want to run your instance in the public subnet. We recommend not to. There is no reason for you to run instances in the public subnet. It improves your security posture. If you run all your instances in the private subnet. And to that effect, we do two things. We deliberately create our public subnets with a smaller size of range. We don't give too many IPs, like flash only seven. With thirty-two APIs, you can do much. Maybe you can run your ELBs and ALBs, any of these and that's what it should be limited too, not running run your easy two instances. They should be running in the private subnet. And that brings me to the last topic in this networking aspect is.

[00:26:20] The default VPCs.

[00:26:21] You don't want to use them. Eventually, you're going to run into various issues at some stage where there is a side conflict. In fact, we had a story there a few weeks ago. We were trying to troubleshoot an application performance. And it had two services. One was running in its own network. The other service was running in its own network. And when service a service call was happening it was slow. So from my point of view it was an easy fix just peer them, right? But we couldn't because they were using default VPCs. So if you use default VPC. You should not use a default VPCs. You are going to run into some kind of problem. And of course ensure that VPC siders don't overlap. You are going to run into serious issues if you let it happen. Eventually, when your services grow, they need to talk to each other. They need to make service and service calls and it will be a problem to even to connect to your internal network. It will be an issue. And that brings to the dev-ops

[00:27:23] practice. So a typical CI/CD workflow of dev-ops, basically, a developer checks in the code and version control. A CI server builds the code, creates an Artifact, Artifact sends it to the factory from where you deploy your code. And CI/CD workflow for Dev sec ops. Basically, you want to inject some security controls in it. Things like a pre-commit hook. Make sure your secrets and partial passwords are not making its way into your into your version control. So, there is an opportunity to scan it before it goes in the version control. And if it goes there, have a periodic scanning of your version control to make sure the secrets and passwords are identified, rotated, and removed from the code. And once the code is in the artifact, your artifact is in the artifact if there's an opportunity to do some scanning of your images as well. But the important thing here is when you are running your services in your dev and production account, how many times it happens that you are trying to release your application, which has worked for months, all the same pipeline works in your non-prod environment, but it doesn't in production. And the reason is certain services which depends on is not enabled. As I said earlier we do not allow all the services in all the accounts. It depends upon the requirement. When we get the requirement, we provision it only for those services, so service whitelisting is important. So ensure that your production account and nonprod account matches with regard to every single security rule and policies. And that's why we see professionals practice like it's real. You don't work in your nonprod (which is different from production). Your security rules should be same in both the accounts. AWS S3 bucket encryption should be enabled in both the accounts. So we created as an API we had you can which you can call from your pipeline, before the production release, and make sure your two accounts match. With respect to every single service and your settings and configuration. So that you don't get surprises later. And if there is differences stop your pipeline fix that first before you proceed with the pipeline. So that is very important aspect of dev sec ops: to ensure your production accounts match between nonprod. There is no reason to keep them different. AMIs. So.

[00:29:57] With the infrastructure as code and you know micros services architecture, it's really easy to do speed up your auto-scaling. We recommend to rely less on config management at boot time. Do the heavy lifting in the AMI itself, have some kind of AMI pipeline which builds your pipe, builds your AMI with your favorite operating system. Test your EMI. Send an API call to a scanner which scans it and sends the results your security team which can look into the result, approve it or reject it depending upon their findings. And then you publish it to all the all of your accounts and this is how it looks. So, basically you have the AMI depository, where are building all AMIs. They get approved. And once they are approved they are published to all your accounts. That way. Everybody's using the new AMI and we do this on 1st of every month the AMIs get rotated and built your golden AMI. We call it base image. They get built every first of the month and now teams have the opportunity to use this new and approved AMI into their code. But an important aspect of AMI management is also to deprecate your older AMIs. If you don't, then you went till you went to have AMI sprawl. You may end up with hundreds of AMIs. And it'll be a nightmare to manage and patch them. So we recommend to have a process where older AMIs get disposed when they're not in use.

[00:31:43] Another aspect of AIM management is while we recommend to rotate your AMIs every few days or weeks whatever is a policy. Have a backup solution. When something like spectre or a meltdown happens, you want to have an ability to patch when you need to. So bake your AMIs with your patching agent

[00:32:00] your security agent, your monitoring agent, and any tools which organization needs to ensure that you have a way to patch it later if you need to. And with that, I'd like to call upon Nathan.

[00:32:18] To talk about the guardrails. Thanks Chimay. Thank you. Nathan:

[00:32:25] So I thought what I might do is talk a little bit about governance as concepts of building what Chinmay summaries having given you a lot of detail about how to really implement that. And now try to ask what are the lessons learned for that and how can we bring those two different organizations. The first thing I like to do and think about governance is actually talk about or think about government. What do you expect from your government. Right. And most people would generally agree or they will disagree on how big or small that government might be. We would generally agree that government should give us some level of protection. Protection from Internal actors external actors. Alright. Government should give us a set of shared services to make our life better. And government should give us the freedom to better live our lives and operate. And so when we think about cloud governance I'd like to sort of come back to those principles of how we protecting how we having services and then how are we giving out application teams the freedom to operate. So the first thing we think about when we have an organization is what are the rules and regulations we want to live by in our organization. Are we heavily regulated? Are we more focused on innovation and speed? Maybe the combination of both in nirvana?

[00:33:36] But basically you have to think about those rules and regulations and if you think about most the average enterprise and how they've worked over the years there's probably a thousand policy documents spread around the organization. There's someone called Joe in the networking team who happens to know how to name every subnet as gets created. You have a whole bunch of enterprise knowledge scripts people processes that have established how that works over time.

[00:33:59] Now the challenges is as you move to cloud. You got to work on how you're going to make that work at scale and speed of cloud. So I'd like to think about the 10-minute server as a way to challenge ourselves with that thought. How do I get from zero to a full server I can log into in ten minutes?

[00:34:16] In a way that we can all sit around a table in front of the CIO and agree that that's an official valid server. So Chinmay spoke about the importance of AMI build pipelines, patching schedules, the way to do authentication, and stuff into those servers. But if you think about breaking up that process, what does it mean to have that server? We have to have a network we agree on that's valid. We have test security group rules we're happy with. We have to know where we're deploying to, we have them build an image we're happy with. We have better launch that of the correct size. Get that running and having a user actually log in and use it or have it be part of that service meaningfully within that short period.

[00:34:52] Now if we haven't defined every aspect of naming service and automated them, the networking, the different parts that. If every one of those decisions isn't defined in a rule, we cannot automate it. So one of the huge challenges that you go to real high at scale in governance is just having the rules defined. So you typically start with a standard set of rules or regulations like the best practices that are out there and that's a great place to start. But you will quickly balloon to hundreds and thousands of policies. I'm not kidding. Thousands of policies. Right. How do you name things? What subnet sizes are your servers? What do I want to do with EC2s? How do you feel about your external access or cross-account access to aliases of lambda functions? Right. That's the sort of coverage you have to start getting at scale in this environment. So you have to think about your rules and regulations. And more importantly, you've got to start to wonder how am I going to set and manage these policies and how I'm going to handle exceptions in that environment. When you have hundreds of teams or hundreds of applications, you have to be able to do exceptions, S3 is always encrypted except for this one key case right. Well, these sorts of services are appropriate except for these situations. So how do you manage exceptions at that sort of scale? And in addition of course, you know why we love Amazon, they're bringing out a thousand new features a year. So now your policy framework and your way of thinking about that governance has to be how to deal with that pace of change. We did not bring out a thousand new features a year in the old data center. Right. That's a whole new pace and challenge. We have to work out how we're going to handle it security professionals right in our enterprise. And be prepared for that. The second part of that, of course, is setting up that infrastructure that is shared services that allow our teams to benefit from what we know. What is the best practice security group? What's the best practice for configuration? About different encryption settings? Or the set up of different environments? We need to be on how we're going to do that at scale. How do we want to do identity repeatedly? So we have to have a method to be able to define those parts and make sure we're feeling comfortable about those services moving people faster. The big change of course as you move to cloud is that application teams suddenly have the power. Traditionally our infrastructure teams had the power in the environment. They could make the decisions; they owned the budget; now application teams are actually more in charge about when servers are deployed when they happen how things change. That change of power balance means we have to work out how to handle, wrap, and respond to them. We have to put them in guardrails; we can't put them behind a gate. The other thing that happens when you do that, of course, is you move from a world of support. "Please start a server for me." To a world of help, "I want to start a server. How can I do that?" And we have to work out how to combine security or bake security into those processes because if we kill that 10-minute server, we've killed the agility of the cloud. If we prevent people from creating a lambda function which includes creating an IAM role, we're killing the agility of the cloud. So we have to think about how our security is going to enable that agility and work with it at that speed. Or we've become the block in the organization. Right. Generally, in security, we're cool to say no. But we don't really want to be the block.

[00:38:20] So that's what we come to real time guardrails. The ability to protect these services and capabilities and respond in real-time. When people are creating these capabilities all the time whether it's using the console a telephone whatever they're doing.

[00:38:33] We need to we had to fix it immediately. Nothing else will get us there. If we want to say I have to review everything before you publish it. We became the block.

[00:38:46] And by the way, yeah life sucks. Right.Bbecause you're spending your whole time reviewing JSON documents and things for people and you just become the bottleneck all the time. Instead what we want to think about is what is the framework of policies and regulations we care about. You can use SQS as much as you like. Just don't do cross-account access. Right. Lambda is fine with IAM roles, provided you with an appropriate boundary for how you're doing it. You're using EC2s is great but it must be inside a VPC that's private and it must be of such and such. So long as we have those rules we can give the business the freedom and we're not arguing anymore about is that a good idea to do this. We're basically saying hey so long as you patched in you're not putting us at risk. Go nuts run a thousand servers. You're meeting my requirement.

[00:39:30] Right. And then we can let them make more of their own decisions and how they want to run with that. As you move to software-defined infrastructure, you need software-defined operations. You need software-defined security. Nothing else is going to keep up. We've given up our custom data centers to move to the cloud for the agility, consistency, etc. that provides.

[00:39:53] We have to now start wrestling with the idea that we've got to give up custom processes, custom scripts, and a bunch of hacks to cover the security and process of that, and think about how we're going to define those operations in a secure, software-based framework that can act in real-time. That's what governance at cloud speed and cloud-scale looks like. It's a software problem. Once we get that right we can actually give the people the freedom they needed which was the whole point of why we exist in the first place, why we're doing this. Application teams need the freedom to use these services. That freedom gives the business the ability, the agility to move. The ability to innovate etc. That's what our goal is.

[00:40:43] Give them freedom while keeping them protected and accelerating them through services.

[00:40:50] So as we think about all these security decisions we have to keep in mind that's ultimately where we want to get to. And how can we automate those decisions through the automated response, the governance, appropriate boundaries around people. It's not about an opinion about how should work, but it's about the facts of what must happen. And you know you'll win if it gets escalated up. Right. And then there's things that should happen which is make them faster.

[00:41:16] So with that, I thought give you a quick look at what governance can look like when you move at that sort of scale and automation.

[00:41:23] So at McGraw-Hill they've used Turbot for a while and what they do there is basically have Turbot running inside their VPC in their environment, and it provides a model for identity, for access across the accounts as Chinmay said restricting service to different teams. So each user would log into Turbot and they see the handful of Amazon accounts they have access to. The permissions in those accounts are actually simplified down to standard levels like admin, owner, operator, read-only, etc. And that's per service as well. S3, operator, EC2 metadata. You need standard language for that type of IAM. Or, you get crushed at scale.

[00:41:57] So they have that ability. But the second thing we want is basically the chance for those developers to actually just use those services with freedom. Use the Amazon console, Use terraform, use Cloud Formation, whatever they might like. So I'm just gonna do the simple bucket creation thing.

[00:42:11] Everyone likes talking about buckets because we understand them.

[00:42:20] So we want to give people the freedom we had to do that sort of stuff. But what's interesting now is what happens next. So in the next 10 seconds, assuming the demo gods smile on me today, what should happen is Turbot will detect that bucket automatically record that change in the CMDB along with who did it. Test it against the policy posture for the organization and then automatically remediate any problems it's found. What that means is we can now give our junior developers the ability to work without fear them making mistakes or leaving us vulnerable, and we can speed up our senior developers that didn't have to worry about this anymore. So let's see what happened. So if I come into that bucket that I just created you can see it's already turned on the versioning and it set the tags. Right. And it's moving through the different things it needed to do. Stuff like the tags includes rules like the cost center and so on, context, right from your organization for this one account.

[00:43:16] So those things have been automatically fixed in that environment already. Maybe the default encryption will come through as well. Let's see what happens. So back in Turbot, we can see the buckets got created. And we see a couple of things going on. First, Turbot's tracked in this seem to be all the details about that bucket.

[00:43:42] It started to standardize some of the information like tagging which helps when you get to a multiplatform world and you want sort of things working across different environments.

[00:43:51] Alright, so we have that tracking in that CMDB. You can see that it's recorded who created that. It was me with a pretty picture of me behind a pretzel.

[00:44:00] And then coming up and you have a bunch of the alarms that were raised versioning wasn't correct. Your default encryption wasn't on. It then automatically remediated those issues.

[00:44:11] And close the alarms. That's a five-second ticket close.

[00:44:16] So that's the sort of automation you want through hundreds of services. We have like seventeen hundred policies like that and moving through different services in the environment. So what it looks like at scale. Now if we go into one of these things see this book it's actually not approved. Turbot tells us why. It actually says it's in an approved region. And you want to prevent everything you can. We do extensive work to prevent things in IAM, but there's a bunch of things you just can't prevent right. And this is just one example of when you might not be able to. Of course, there's rules here. You could. So an unapproved region. It's raised an alarm and said okay this one's approved. Let's look over here at policies controls of what gives us alarms in the state of the environment but they're based on policies. As I said we need a whole suite of policies in our environment for what's valid. Which regions are allowed? Which is encrypted? You're naming standards? So if we look at the policies here we can see something like what is the approved regions. And in this case, we're saying how U.S. regions are valid for buckets.

[00:45:13] Now in Turbot, the policy engine, in the way that works is it comes down hierarchically. Instead of rule at the top. I want to only use U.S. for all the buckets in all of my accounts and that would be true now and into the future for every new thing.

[00:45:27] But what you often need to do is start to set exceptions because maybe this one bucket's actually okay that it's in Ireland, so you can create an exception to say, well actually this one I'll let it live in Europe, and this is the exception that beats a MUST (required) rule from above-recommended rules. You can let teams beat any time MUST rules they require permission to do. And we can save that exception. As soon as we do that. Turbot now knows about that exception and it will rereview that policy, the control to say is it approved? So if we go back to the control wing see it's flipped in our case state. This is now an approved. This bucket is now approved to live in that region.

[00:46:09] In general those simple policies give us the power we need for exceptions and stuff like that. And we try to keep that language very very simple. What's cool is though you need to get to see all the exceptions in your environment. Like I said you end up with a lot of policies. So if we come up here we can actually see at least of all the exceptions we've granted around bucket regions. If you're trying to manage a security team got all the exceptions you've in that environment you don't want them hardcoded in JSON and a thousand locations. What you want, is this simple place we had to see those decisions, have them expire, set rules around those things. Now sometimes you want to get more fancy in your rules. And decide that, okay. It might be okay to have a bucket in Europe if the tags and such and such people do really really crazy things when you get to hundreds of accounts and scale. So what do you want tere is stuff that gets even more powerful about the decisions. So we call these calculated policies. Now in Turbot, because you've got all that information on a macro discovered into the same CMDB.

[00:47:07] You can actually search that as part of a policy decision. So here we can set do a quick search for like what for this bucket. Turbot has normalized some information like the tags that's a graph input query against the CMDB. And it's found all the tags for that bucket. And then we can start to say with that information there. Now what do I want to do with that. I'm just gonna grab a quick code snippet here.

[00:47:34] So that's some templating language just some Ginger 2, if you're into that sort of thing. Right. And what it's doing then is it saying based on the information context for this one bucket and its tags. Using this template we can make a decision about what we want this policy value to be. You could look at the name of the bucket, does it have a correct prefix you're tagging whatever you want it actually doesn't even matter. Right. And then that policy gets calculated there for that one bucket and it goes along from that. So we can then just say that policy as we did before. Turbot is now going away in the background and calculating that policy for us.

[00:48:07] And determining then the state of the control for that bucket. Now if you have a thousand buckets under that policy it will recalculate all of them individually and contextually.

[00:48:16] Right. So again you're now thinking in policy posture, rules across your environment. You're not thinking about small exceptions in text files in a thousand different locations once you get to that sort of scale. Now in the same DB, we have recorded the information about this bucket and we have it here, and we actually put that in context in Turbot so we see all the different controls here for this one bucket. So some of the standard controls you might want to think about in your environment is stuff like active should these still exist at this time. Right. You might set a rule like, in a sandbox account all resources should be no older than 60 days.

[00:48:52] This to stop people running pseudo production in there right. All things like the, is it approved to exist here, is the fold encryption on. You can go nuts. We have many many many of those sorts of things. Now if you roll up in Turbot you can see controls here across the whole region. Divided down by the different resource types right. And the state of different things. We can go up to the account. And see our summary that controls there and we can keep going to see across different platforms across different environments and all that sort of stuff. If we sort our alerts here, we can see actually the top one is CIS controls in this environment, and now we can start to drill down and see information about CIS for our account. So Turbot organizers these hierarchically as well, and these by the way, happen in real-time. You don't want a report you run once a month, you get the results much later. You want to see how it changes all the time. You change one thing, it happens real-time in Turbot. If we drill into something like logging here we can see different stuff. Like, for example, is the flow login enabled. Right. And then per VPC, we can actually see the rules around that. So you can see that control for each target resource in the environment.

[00:50:03] See the state of it over time how it's changed if it became compliant, became non-compliant, we have all of that history information.

[00:50:11] Now one of the things that's interesting is once you get to this scale of automation and like I said you will hit hundreds and hundreds of policies. It's a fact, and it's it gets hard to manage.

[00:50:21] What you need to do is start organizing that information very carefully. What are the names of your policies? What's the architecture of your approved, active, configured, your data protection, standard names, and structures through all of that, so you have a language to talk about. Otherwise, you're buried in minutia. The other thing you need to do is start to categorize that stuff. So in Turbot, what we've learned over the years, we started with scripts have gradually built out, we now categorize everything. So we have a my bucket, which is of type of AWS S3 bucket and is of the category storage container.

[00:50:55] Right. Now you can imagine as you go across all the clouds or other environments how that starts to group things into categories or information. Same for controls. CIS is not only a benchmark for AWS, but it's benchmarks for Linux, for other providers, other tools out there. And those are categorized by CIS into a control framework. So we can come in turbot and now view CIS report by the controls which is now a cross-platform and multiple, full-stack view of the control framework.

[00:51:26] So we can see different things there and then we can start to divide it up so we can say hey show it to me by the control category. Right. We can drill down into different areas and cut it in different ways. Right. So now we're in that maintenance section 6. Then we can say, hey show it to me by resource category. So here's section six broken up by the type of resource.

[00:51:48] And that gives you a lot of insight to what's happening in the environment. The ability to target your quick wins the different capabilities you want.

[00:52:00] We spoke about permissions before and the importance of that as part of the setup of management environment. So Turbot provides a commission model, which is what McGraw-Hill uses at scale across the environment. They use it for Linux as well actually. So Linux level authentication as well as the accounts.

[00:52:15] Now what's cool about this model is it actually follows the hierarchy again, so we can grant permission to a user and that's a straight-up search of our directory. We can choose which permissions to grant.

[00:52:26] This is a simplified list. Like we track thirty-five hundred Amazon permissions at this point. So if you're really trying to do that at scale, I highly encourage you to go to a standard language around how you want to do them. And it's basically then you want to grant things like simpler, smaller groups of it, like for example you AWS read on my S3 read-only.

[00:52:45] We support exploration of those sorts of grants. So, you can say okay you get read-only on that for the next six hours. Right. And then that will automatically expire. That's really important when you want to start to have models like security group, your security team. You might wanna give them access at the top level across hundreds of Amazon accounts for a diverse metadata access. For us, metadata means no reading of data. Right. Just seeing. Seeing the resources. Metadata, read-only, operator, admin, owner, standardize that stuff out.

[00:53:14] Right. So once you give people like that access to the metadata, they can see the environment and move right through it. Then you might say, Okay you need admin on a temporary basis to be out of troubleshoot something or fix things, so you use those automatic expirations. We also, by the way, do temporary elevation if you're into the really really hardcore.

[00:53:29] Stuff around that. So that sort of permission model gives you a lot of flexibility. For your CMBD, you want to be out to do things like search. Now to find the different resources in the environment or to search by your details, like for example, I want to see everything in my environment that's from the sales department.

[00:53:52] You also want to be out to run queries across your whole environment, for example, to be added to use an API to start getting stats out running extra reports going nuts right with that sort of thing. That's what software-based governance looks like. It doesn't look like a bunch of small scripts in different places you've got to manage by hand. Now the last thing I'll mention is we've started to understand that, we started off AWS, S3, EC2. Governance really is such a broad topic, with the companies we work with we're finding that they're struggling with governance across the enterprise. Your internal Domain Name Service certificate expirations a little dummy example here. There are hundreds of problems where we're trying to work out how to do this automated governance at scale, across our pipeline tools our external SaaS providers etc. So we've started to think about that as something now that's customizable and extensible. So this an example of a custom policy, we have a simple CLI tool which I'd be happy to talk about another time, to develop this. But basically, all the policies and stuff are defined using very simple JSON schema language definitions, and then they appear in the policy engine. So you get the hierarchy, you get to calculate policies, all that sort of stuff. The control framework, the slicing and dicing, the reporting, and it gives you a way to start thinking about these scripts. Beyond deployed, allowing you to have a lambda style code which then is automatically deployed, run multi-region high-availability although logs are sort of centralized in. So it takes that governance to a whole different level and then your users can use those like they do any other policies with the validation and stuff in the UI to give them access. So if we change this certificate warning period to 365 days for AWS Amazon thing, we'll see I think they've got about one hundred seventy-four days left. Right. So they have time, but then the space you can go to for this type of governance, the sorts of rules you can write for this type of automated governance.

[00:55:43] Is limitless. And that's the future of security.

[00:55:51] Coming up with our rules. Defining those. Consistently with good names and then running them in real-time.

[00:56:01] In an environment where application teams have the power. And our service providers adding thousands of features a year.

[00:56:09] And that's why what we say is that in effect we have to start to accept that movement. From a world of like software-defined. As we move that software-defined infrastructure, we have to start moving the software-defined operations. When we do that, the speed of our organization is unparalleled. Our cloud teams move faster because they've got automation; our application teams move faster because they've got services to build on. We're safer than ever before. I mean imagine saying every data storage was encrypted in your internal data center. That's now an automation away. Right. We're better off than we have been. Making it more accessible our junior developers can do more stuff because they're not afraid of the mistakes they might make. And we're more productive because we don't have to spend our time reviewing JSON or reviewing things once, we've set our posture in place for the first application. The first account. It's guaranteed for the second one. We don't have to review each thing by hand anymore. What we do now is when people do something crazy and come to us and say I need an exception they were blocked from the start and now they're asking for permission. The traditional model is they go crazy do a heap of stuff and then come at the end for a review before production. And now we're blocking them right. Automated guardrails running in dev and prod consistently change that relationship and change the structure of it from one of fix this for me.

[00:57:31] You have to do this for me to one of help me solve this problem. Help me get this service going in my environment.

[00:57:40] And that's why you see that change of relationship happen is you start to have automation of all the basic stuff.

[00:57:45] So Chinmay was able to talk today about so much stuff they've done with pipeline and building all these capabilities and you were excited cause we feel like we were able to enable that by doing so many of those basic standard things that organizations have to do over and over. Beneath that automating the setting up of those guard duties, automating the setting up with those flow logs, that's all just standard stuff.

[00:58:04] That every organization does. So this thought, I wonder why would we do it by hand.

[00:58:11] And that change of relationship happens once you have that security, automation, and monitoring of the underlying layer. You can give freedom to your application teams to use those services.

[00:58:20] They can work in a relationship with the cloud team to move faster. They want to try a new service, you can create an exception for it in one account. Let them learn, innovate together, right and work out how to do it and then automate more of that stuff and then others can use it. So you started to turn those standards into the things are gradually automating and improving over time.

[00:58:42] So thank you so much for your time listening to us today. We're super excited talk about this topic. If you have any questions for Chinmay or myself, hit us up. We'll talk about it all day and you have a good rest of the show.

If you need any assistance, let us know in our Slack community #guardrails channel. If you are new to Turbot, connect with us to learn more!