Disclaimer: Automated Transcript
[00:00:00] Good morning everyone.
[00:00:03] My name is Chinmay Tripathi. I'm the Director of
Cloud Engineering for McGraw-Hill. And today's talk is: Lessons learned
from the multi-year implementation of cloud governance automation. And
we'll be talking about how we effectively manage cloud automation,
operations with regards to security, networking, IAM controls, operating
system hardening, and patching. And I'll be joined by Nathan Wallace who
is the CEO and Founder of Turbot.
[00:00:43] All right. So some of the lessons learned. And
before we get into the lesson learned a quick introduction on McGraw-Hill.
McGraw Hill is the learning science company, and our data scientists are
constantly curating billions of data interactions. Which are created by
our students by interacting with our digital products. And we are trying
to identify trends patterns and opportunities to improve learnings for
individual learners as well as helping our teachers to teach better. And
we partnered with over 14,000 authors and educators including 50 Nobel
laureates. On the technology scale, we have more than 80 dev teams. More
than a hundred database accounts. More than 10 Kubernetes clusters and
4000 Amazon Compute Cloud (EC2 instances). So, how did we get to hundreds
of database accounts? I think that is going to be the central theme of
this talk and how we manage operating in the cloud at this scale.
[00:01:51] So chances are you are in the cloud because you
want agility; you want elasticity; you want auto-scaling; and all that
good stuff which cloud brings to you. Most of the organizations start
small. And we did that too. You start with like a small POC, maybe
migrating one or two applications in from the data center into the cloud.
You start with a small team working on a shared goal and the goal is to
bring the stuff in the cloud. And eventually, you are successful. After a
few weeks or months, you build a team you build a pipeline you build
services around it, maybe are using config management tools like puppet,
salt, or chef. And then you grow. You are now moving maybe 10 or 15
applications into the cloud. Now you have, you go to stage 2 which we call
a Share House. What it means is that you have multiple teams working on
different projects in one single database account. And this model works as
well for some time. And then inevitably it happens where somebody makes a
security group change thinking they are doing it in a dev environment. But
that security group is shared by a production application. Or maybe a
routing table change or maybe a national rule change. Or maybe somebody
updated the AMI and updated the Java version thinking you know, it's not
going to effect everybody else but that AMI is shared by their
applications. So that model is also not working. So you need to separate
your workload the production workload from your dev environment. And that
brings us to the next stage which is Hosted Service. What it means is now
you split your environment into Dev, test, and prod kind of model and you
separate your workload or production workload from your dev environment.
And this is the model we think most of the organizations stay for a bit
longer than the first two models. You could go on for like maybe six
months, a year. But this is also not a scalable model. Right. Now you're
in the cloud. You have built a team, a centralized team. You can call it
your cloud team or your dev-ops engineers. They're getting comfortable.
They are building your CI/CD pipelines, building services. Your ec2
infrastructure, all that good stuff cloud brings. They have their the
tribal knowledge. They have created a lot of services around it.
[00:04:22] But business demands more. Now you're not competing
yourself from data center time. You know you want more frequent releases
you want ten releases a day. Now when you scale it. To hundreds of
services. How do you do that? You're this core team which has been doing
your pipelines and managing your workload in the cloud. Cannot scale with
the pace of innovation business want; business want more agility. You
know, bringing more value to the business. So you and we face the same
issue. So we try to solve it. We hired more people. But ultimately we
realized that this is this model is not scalable. And that brings us to
the next stage. So we looked into this problem and we thought OK maybe
we'll let more developers and engineers give access to the cloud API
because ultimately it's all in code. Developers can code their
application. They can code infrastructure as well. But you wanted to
ensure that the separation of duties maintained; you need a lot of
security controls. And we thought we could do that. So we started writing
IAM policies right. So that if a developer is working on a few services
he's kind of separated from the other workload running in the cloud.
Right. And we tried to write the IAM policies. It doesn't work. And the
reason for that is. IAM policies at hard right. And if that is all you're
doing constantly. Eventually, you know, it just doesn't work doesn't
scale; you're doing back and forth. Too many change control meetings and
you know, it doesn't work. So we'll have been to this we needed a better
solution. So we looked to the model of Amazon. Amazon doesn't build a data
center for every one of their customers. They have a multi-tenant model.
So what we did, we started to give accounts to each of their teams or the
or the business units and we ended up with hundreds of database accounts.
So the benefit of that model was now we have two pizza teams in one
account. Right. They may be running four to five services. Six to eight
developers and engineers. And the blast radius is minimized now. And
activity done in one account would not impact other accounts. So you have
solved the problem of agility, the freedom, and now your piece of
innovation is much faster. Which is what you need to do. You don't want to
slow down your developers because of your security controls and
operational controls and things like that. But you have solved one set of
problems and traded it with another set of problems. And that brings us to
the next slide. So when we look into the overall footprint of us. You know
we had issues in security, cost management, networking, and develops
pipeline. Things like how do you replicate all your security best
practices across hundreds of database accounts? How do you ensure that not
all the services that enabled in every single account? If a team comes to
you and ask for an account and they just want to run serverless, maybe you
don't need to give ECS Access or ACM access or you know, S3 access maybe.
You want to restrict the services which is required in that particular
account. A service-wide listening is very important. How do you ensure
your cost doesn't go over the roof? So there has to be some kind of a
budget enabled in each account. Maybe you need to create a certain key
player in every account by default so that the future automation is easy.
Maybe you need to create a service account later at some point. Or maybe
you need to do user management. In security you want to ensure you all
your AMIs are managed properly. Now you have your scale is so big and you
don't have your core dev-ops team anymore. It is open across all your
organization. So you need to have the security controls and policies and
rules and regulations in which all these accounts can operate.
[00:08:34] Alright. So that brings me to the next slide. And
we will be talking about some of the lessons learned in the area of
security. The first is root convention management and I think at this
point everybody knows that using root credential is a bad practice. Nobody
should do this but I'll repeat it anyways.
[00:08:52] Make sure you are rotating your root credentials.
Root passwords. Enable your MFA. Enable your security questions.
[00:09:02] And don't allow any application to use the rule
keys in their pipelines. But here's one extra tip. To discourage people
from using root. Have your root contentious manners by two separate teams.
And what we do is essentially have the password managed by one team and
the MFA managed by another team. That way they always have to coordinate
and collaborate. If you want to log into your AWS account as a root user
it. And of course, rotate your keys every few months.
[00:09:35] Service accounts. What we are trying to say here is
how many times it happened that you were in the middle of a big product
release. And everything is going well. Everybody is excited. You're
running your pipeline job and it doesn't work. The job fails. You look
into the logs and you find out that you know the credentials don't work
anymore. What happened?
[00:10:00] The user which belongs which owns the connection is
no longer part of the organization. So good job expanding the credential,
but you shouldn't have used that credential at the first place. So
recommendation, use machinery access for machinery jobs. If you're running
a pipeline or a release or deployments use service accounts or machinery
access like IAM rules to deploy your code and not credentials which is
tied to an individual users because users are eventually going to leave.
[00:10:29] If you do use users, use them for POC, or for
testing, troubleshooting, and stuff like that. The next is automated
account configuration and that is that is very important. Especially when
you're operating at a scale of hundreds of accounts. and you have to
provision these accounts all the time.
[00:10:49] You want to ensure all your security controls your
security policies are fully automated. Your VPC creation, your security
groups, your user management. your access management, it should all be
automated.
[00:11:03] And now we're going to talk about some of the
security services which we have found very helpful in managing this.
Managing our cloud environment. So before I get into that can I get a show
of hands how many of you use flow logs, cloud trail, and guard duty. OK,
that's pretty good. So to those who don't. I'm a quick rundown on these
services. Flow Logs. How many times it happens that you are trying to log
in into an EC2. And you can not log in, right. Happens with us all the
time.
[00:11:38] It could be a problem in your network where you are
operating in would be a local issue. Could be a routing issue. Could be a
firewall issue in your data center. Could be an issue with the security
group on the EC2 instance. Or could be an issue with your NACL. Or it
could be a local issue on the EC2 instance itself. Maybe the user is not
set up yet. Or maybe the keys have not been created yet.
[00:12:03] Use flow logs. A great tool for network
connectivity troubleshooting.
[00:12:10] Basically captures all the IP traffic metadata
including source IP address destination, port protocol, as well as whether
traffic was accepted or rejected. So great tool. We use it all the time
helps tremendously. But it is also a great security monitoring tool. So
the challenge here is how do you take all these logs and put to a
centralized location where security teams can analyze these logs and
create some analytics or maybe integrate with CloudWatch Events to send
out some notification and take some actionable or create some actionable
workflow. So we can use you can use CloudWatch Logs or AWS S3 bucket and
use Kinesis stream to maybe to send it to your SIEM solution where they
get a single pane of glass and know what's happening across all of your
hundreds of accounts.
[00:13:00] The next is CloudTrail. Another great tool from
Amazon.
[00:13:04] You are troubleshooting; you're not maybe you are
doing an RCA. Something went wrong.
[00:13:09] A security group was deleted. Nobody knows who did
it. You're trying to find out. Enable an Amazon CloudTrail, and it will
tell you every single database API call made in every one of your
accounts. And who did it. What time was it done. What action was taken.
Enables governance compliance and risk auditing tremendously in an
account. So definitely a very important service to be enabled in all
account. And the point here is how do you do it at scale? So you have to
have a process when you provision an account. All these basic services
need to be enabled by default. At the same time, you have to ensure that
nobody goes and changes it.
[00:13:45] It's possible that you enable cloud trail and
somebody is trying to remove the traces of their activity. You should get
an alert and maybe have an automated guardrail to correct that mistake. So
detection and correction immediately is very, very important when you're
managing hundreds of accounts. You cannot have you know. Do it after two
weeks or three weeks because by then it's probably too late already. And
that brings me to the third service which is my personal favorite. And
that is guard duty. This is relatively newer service from Amazon. So with
flow log and cloud trail.
[00:14:20] You have a lot of data. You have all the API calls
event logs you have your network traffic.
[00:14:26] So security is getting all these logs and they are
constantly analyzing it and it's very easy for them to get overwhelmed.
It's a lot of data. They can try but it doesn't really work very well in a
sense it just too much information there. What do you do with it? How do
you create your events?
[00:14:45] Guard duty solves that problem tremendously. It is
a threat detection service which is continuously analyzing all the VPC
flow logs, all your cloud trail events. And trying to do threat detection
by analyzing these logs and finding any kind of malicious activity or you
know a potential threat in your account. Within the first week of enabling
guard duty, we were we had some major successes. We found out many
applications were still using root credentials. Within the very first week
of any of being guard duty. The other day we had we had an issue where we
got an alert from guard duty.
[00:15:27] That some user has just terminated 20 EC2
instances. You got the alert started looking into it and we're like this
happens all the time right? People go and spin up instances and
infrastructure and terminate them. So what's the big deal right? We
started looking into it. And found out the user had no business in that
account. It was a mistake. He was supposed to work in a different account
but he, you know, the keys got mixed up or whatever. But he landed a
different account and terminated the instances. It's a great tool for
things like that. Yup. SSH Port. It should never be open from internet. We
get alerts from Amazon GuardDuty all the time. People make mistakes. Even
then they know the best practices. Sometimes they're in a hurry they're
troubleshooting something they enable Port 22 from Internet and next thing
you know the Amazon GuardDuty sends and alert, and next thing you know you
are getting brute force attack. So great tool if you have not checked it
out please do so. It's going to make your life very easy.
[00:16:23] Automated security response. And that brings me to
the topic of automated guardrails. And what we are trying to say here is.
If you as an organization have a policy. To let's say encrypt all your S3
Buckets. Right. And you have all your best practices laid out.
[00:16:43] But now you have hundreds of people who are
creating a S3 buckets every day. And chances are that some of them will
not encrypt their bucket. So what do you do. Do you get an alert. And send
an email follow up do follow up meetings with your developers. Are you
actually go and correct it. That kind of security response has to be
automated. If an AMI is old, older than six months or maybe a year do you
want to allow your EC2 instances to spin up using that old AMI, or you
want to stop and everybody who is trying to use an old AMI? So that
automated security response is very critical aspect of managing cloud at
scale. And that brings me to this slide where you can see how we have our
environment. This basically tells you how our environment looks like. In
the middle, you have the guardrail automation. And this is the security
rules and policies we have in a framework and we use a tool called Turbot,
which is always constantly monitoring all the accounts against our
security policies and ensuring that the end desired state is always
maintained. With respect to, and I'll give you some examples. Things like
cross-account access; maybe you don't want cross-account access for S3
bucket or lambda. You want versioning to enable all the time; you want to
ensure your access logging policy is maintained on your S3 bucket. Right.
Your AMIs. You want to ensure that nobody spins up in a region which is
not approved by your organization. So this guardrail is constantly working
for us behind the scene. While we don't have to automate that we don't
have to worry about somebody creating an S3 bucket without encryption
turned on. Or maybe somebody is using an S3 bucket with no encryption in
transit. On the left we have the networking automation; all the accounts
get provisioned using automation; all of these VPCs, routing tables,
subnets. Security groups created by automation in a repeatable and
consistent way. And of course, all of this is tied to your Active
Directory. So that you don't have to manage. IAM users in your account.
When you are managing hundreds of accounts. You can do that. You have to
tie it to some kind of federation. And people just use that to get into
any of the accounts they want. And we use the Turbot for access management
and user management. It makes our life so easy. And we talked about VPC
flow log all that cloud trail and guard duty. And the way we do all these
logs needs to be centralized. So we use cloud watch logs. And send it to
all those cloud watch logs look to a stream and eventually stream it back
to our Sim solution. Now our security teams have a single pane of glass
where they can have visibility into every single activity happening in all
the accounts at one place. They don't need to log in into new accounts.
They have full visibility into everything. And that brings me to the next
slide, cost management. I know this is not; this is a security conference
but it's still part of our lessons and some of the aspects do belong to
security. So right-sizing infrastructure is very critical. The cost in
cloud can go out the roof.
[00:20:08] If you're not careful especially if you have
hundreds of accounts people can spin up 16 extra instances and leave it
running and forget about it a week later or two weeks later you might get
the bill and you are like ten thousand dollars you know over your budget.
So right sizing is very critical to scale in and scale out when the demand
goes up and down is very important to keep the costs down. But Amazon
gives you an option to purchase result instances and what it means is you
commit to a certain instance uses over one year or three-year term and
Amazon gives you a 60-70 percent discount.
[00:20:43] So use that. Some tips over there is good short
term, if you're not sure about your workload. You can go with one year if
you're not. Because this happens all the time and reinvent. People go to
reinvent and come back. Hey. Amazon is launching a new instance type is
pretty cool. Let's use that they come back. And spin up their interest
section in the new instance type. And now you're paying double, you're
paying for your recent instances as well as on-demand pricing. So go with
one year if you're not sure by that way you you have an opportunity to
correct mistakes after one year rather than wait. You are getting locked
in into two-year contract. And that brings us to some of the other
cost-saving opportunities. Monitor for your unused resources things like
Volumes which are not associated with an EC2 instance anymore. And this
happens all the time. People spend up EC2 infrastructure. Terminated
volumes get left behind. They spin up ELBs and then leave it running for
months. They forget about it because next time they gonna use another ELB.
Audio nav gateways, audio snapshots. Individually these don't cost a lot
but we were able to save hundreds of thousands of dollars by just
automating EC2 volumes removal. So some it was like and we use the
Guardrails as well for that. If a volume is not associated with an EC2 you
tag it. You wait for a week. Take a snapshot. Put it aside and then remove
the snapshot after three months or six months or whatever it is your
rotation policies. And same thing is with your gateway as well they did
they do cost money. Another tip, for your instances your dev instances
your sandbox instances are testing instance, which you don't use 24 by 7.
Shut them down. 8:00 p.m. to 8:00 a.m. maybe you want to shut them down
and by the time your developers come in they can be up and running. Or
during the weekends or when there is no testing happening your testing
instances don't need to be up and running. You can use a service called
Instant scheduler by Amazon to automate that part for you. And that brings
me to the networking aspect of managing your Cloud Environment.
[00:23:08] So now with CI/CD these pipelines infrastructure as
code, developers are the king. And they can build your infrastructure.
They can build the application. They can run code at scale and things like
that. But developers are not network engineers. While VPC creation is a
relatively easy task; you make an API call your VPC is created, but there
is a lot more goes into managing a network especially if you're operating
at hundreds of accounts level. You need to spend some time thinking about
your network. Ensure that you are using some kind of a design principle.
Things like how many Availability Zones you want to enable in every
account. We recommend to use three because that gives you high
availability and you know if availability zone goes down you have some
backup. But you have to have some kind of a consistent way of replicating
this in all your accounts. So automated VPC creation, have a repeatable
and consistent way. Make sure that naming conventions are followed.
Tagging is right. Because when you want to scale your network and
everybody is going to do that at some point and I'll give you a story. But
it helps. Recently be able to roll out transit Gateway. And to those who
don't know about this. It's a new service relatively new service from
Amazon. It allows you to unify all your networks. It makes it a once
single network. Now you don't need to period account. If you want the high
bandwidth when you are making service or service calls between two
networks. They can go or transfer Gateway ads if they are sitting in one
network. So we had to rule out our transit gateway. In the accounts we
knew where we had a consistent naming convention and tagging and
consistent network design. With regards to VPC names and subnets and
Availability Zones and routing tables.
[00:24:59] We were able to roll it out within seconds without
creating any impact to any application across hundreds of accounts. But
the accounts were which were all their accounts where things were done
manually you know hundreds of subnets, hundreds of route tables, and stuff
like that. It was absolutely a nightmare to update your route tables,
without impacting your application. So we ended up doing a lot of hand
massaging and handcrafting scripts manually verifying it detailed tables
are correct before and after and things like that. So it will pay you off
in the long run if you are automating and giving it some thought. When
you're designing a network at scale. Things like do you want to run your
instance in the public subnet. We recommend not to. There is no reason for
you to run instances in the public subnet. It improves your security
posture. If you run all your instances in the private subnet. And to that
effect, we do two things. We deliberately create our public subnets with a
smaller size of range. We don't give too many IPs, like flash only seven.
With thirty-two APIs, you can do much. Maybe you can run your ELBs and
ALBs, any of these and that's what it should be limited too, not running
run your easy two instances. They should be running in the private subnet.
And that brings me to the last topic in this networking aspect is.
[00:26:20] The default VPCs.
[00:26:21] You don't want to use them. Eventually, you're
going to run into various issues at some stage where there is a side
conflict. In fact, we had a story there a few weeks ago. We were trying to
troubleshoot an application performance. And it had two services. One was
running in its own network. The other service was running in its own
network. And when service a service call was happening it was slow. So
from my point of view it was an easy fix just peer them, right? But we
couldn't because they were using default VPCs. So if you use default VPC.
You should not use a default VPCs. You are going to run into some kind of
problem. And of course ensure that VPC siders don't overlap. You are going
to run into serious issues if you let it happen. Eventually, when your
services grow, they need to talk to each other. They need to make service
and service calls and it will be a problem to even to connect to your
internal network. It will be an issue. And that brings to the dev-ops
[00:27:23] practice. So a typical CI/CD workflow of dev-ops,
basically, a developer checks in the code and version control. A CI server
builds the code, creates an Artifact, Artifact sends it to the factory
from where you deploy your code. And CI/CD workflow for Dev sec ops.
Basically, you want to inject some security controls in it. Things like a
pre-commit hook. Make sure your secrets and partial passwords are not
making its way into your into your version control. So, there is an
opportunity to scan it before it goes in the version control. And if it
goes there, have a periodic scanning of your version control to make sure
the secrets and passwords are identified, rotated, and removed from the
code. And once the code is in the artifact, your artifact is in the
artifact if there's an opportunity to do some scanning of your images as
well. But the important thing here is when you are running your services
in your dev and production account, how many times it happens that you are
trying to release your application, which has worked for months, all the
same pipeline works in your non-prod environment, but it doesn't in
production. And the reason is certain services which depends on is not
enabled. As I said earlier we do not allow all the services in all the
accounts. It depends upon the requirement. When we get the requirement, we
provision it only for those services, so service whitelisting is
important. So ensure that your production account and nonprod account
matches with regard to every single security rule and policies. And that's
why we see professionals practice like it's real. You don't work in your
nonprod (which is different from production). Your security rules should
be same in both the accounts. AWS S3 bucket encryption should be enabled
in both the accounts. So we created as an API we had you can which you can
call from your pipeline, before the production release, and make sure your
two accounts match. With respect to every single service and your settings
and configuration. So that you don't get surprises later. And if there is
differences stop your pipeline fix that first before you proceed with the
pipeline. So that is very important aspect of dev sec ops: to ensure your
production accounts match between nonprod. There is no reason to keep them
different. AMIs. So.
[00:29:57] With the infrastructure as
code and you know micros services architecture, it's really easy to do
speed up your auto-scaling. We recommend to rely less on config management
at boot time. Do the heavy lifting in the AMI itself, have some kind of
AMI pipeline which builds your pipe, builds your AMI with your favorite
operating system. Test your EMI. Send an API call to a scanner which scans
it and sends the results your security team which can look into the
result, approve it or reject it depending upon their findings. And then
you publish it to all the all of your accounts and this is how it looks.
So, basically you have the AMI depository, where are building all AMIs.
They get approved. And once they are approved they are published to all
your accounts. That way. Everybody's using the new AMI and we do this on
1st of every month the AMIs get rotated and built your golden AMI. We call
it base image. They get built every first of the month and now teams have
the opportunity to use this new and approved AMI into their code. But an
important aspect of AMI management is also to deprecate your older AMIs.
If you don't, then you went till you went to have AMI sprawl. You may end
up with hundreds of AMIs. And it'll be a nightmare to manage and patch
them. So we recommend to have a process where older AMIs get disposed when
they're not in use.
[00:31:43] Another aspect of AIM management is while we
recommend to rotate your AMIs every few days or weeks whatever is a
policy. Have a backup solution. When something like spectre or a meltdown
happens, you want to have an ability to patch when you need to. So bake
your AMIs with your patching agent
[00:32:00] your security
agent, your monitoring agent, and any tools which organization needs to
ensure that you have a way to patch it later if you need to. And with
that, I'd like to call upon Nathan.
[00:32:18] To talk about
the guardrails. Thanks Chimay. Thank you. Nathan:
[00:32:25] So I thought what I might do is talk a little bit
about governance as concepts of building what Chinmay summaries having
given you a lot of detail about how to really implement that. And now try
to ask what are the lessons learned for that and how can we bring those
two different organizations. The first thing I like to do and think about
governance is actually talk about or think about government. What do you
expect from your government. Right. And most people would generally agree
or they will disagree on how big or small that government might be. We
would generally agree that government should give us some level of
protection. Protection from Internal actors external actors. Alright.
Government should give us a set of shared services to make our life
better. And government should give us the freedom to better live our lives
and operate. And so when we think about cloud governance I'd like to sort
of come back to those principles of how we protecting how we having
services and then how are we giving out application teams the freedom to
operate. So the first thing we think about when we have an organization is
what are the rules and regulations we want to live by in our organization.
Are we heavily regulated? Are we more focused on innovation and speed?
Maybe the combination of both in nirvana?
[00:33:36] But basically you have to think about those rules
and regulations and if you think about most the average enterprise and how
they've worked over the years there's probably a thousand policy documents
spread around the organization. There's someone called Joe in the
networking team who happens to know how to name every subnet as gets
created. You have a whole bunch of enterprise knowledge scripts people
processes that have established how that works over time.
[00:33:59] Now the challenges is as you move to cloud. You got
to work on how you're going to make that work at scale and speed of cloud.
So I'd like to think about the 10-minute server as a way to challenge
ourselves with that thought. How do I get from zero to a full server I can
log into in ten minutes?
[00:34:16] In a way that we can all sit around a table in
front of the CIO and agree that that's an official valid server. So
Chinmay spoke about the importance of AMI build pipelines, patching
schedules, the way to do authentication, and stuff into those servers. But
if you think about breaking up that process, what does it mean to have
that server? We have to have a network we agree on that's valid. We have
test security group rules we're happy with. We have to know where we're
deploying to, we have them build an image we're happy with. We have better
launch that of the correct size. Get that running and having a user
actually log in and use it or have it be part of that service meaningfully
within that short period.
[00:34:52] Now if we haven't defined every aspect of naming
service and automated them, the networking, the different parts that. If
every one of those decisions isn't defined in a rule, we cannot automate
it. So one of the huge challenges that you go to real high at scale in
governance is just having the rules defined. So you typically start with a
standard set of rules or regulations like the best practices that are out
there and that's a great place to start. But you will quickly balloon to
hundreds and thousands of policies. I'm not kidding. Thousands of
policies. Right. How do you name things? What subnet sizes are your
servers? What do I want to do with EC2s? How do you feel about your
external access or cross-account access to aliases of lambda functions?
Right. That's the sort of coverage you have to start getting at scale in
this environment. So you have to think about your rules and regulations.
And more importantly, you've got to start to wonder how am I going to set
and manage these policies and how I'm going to handle exceptions in that
environment. When you have hundreds of teams or hundreds of applications,
you have to be able to do exceptions, S3 is always encrypted except for
this one key case right. Well, these sorts of services are appropriate
except for these situations. So how do you manage exceptions at that sort
of scale? And in addition of course, you know why we love Amazon, they're
bringing out a thousand new features a year. So now your policy framework
and your way of thinking about that governance has to be how to deal with
that pace of change. We did not bring out a thousand new features a year
in the old data center. Right. That's a whole new pace and challenge. We
have to work out how we're going to handle it security professionals right
in our enterprise. And be prepared for that. The second part of that, of
course, is setting up that infrastructure that is shared services that
allow our teams to benefit from what we know. What is the best practice
security group? What's the best practice for configuration? About
different encryption settings? Or the set up of different environments? We
need to be on how we're going to do that at scale. How do we want to do
identity repeatedly? So we have to have a method to be able to define
those parts and make sure we're feeling comfortable about those services
moving people faster. The big change of course as you move to cloud is
that application teams suddenly have the power. Traditionally our
infrastructure teams had the power in the environment. They could make the
decisions; they owned the budget; now application teams are actually more
in charge about when servers are deployed when they happen how things
change. That change of power balance means we have to work out how to
handle, wrap, and respond to them. We have to put them in guardrails; we
can't put them behind a gate. The other thing that happens when you do
that, of course, is you move from a world of support. "Please start a
server for me." To a world of help, "I want to start a server. How can I
do that?" And we have to work out how to combine security or bake security
into those processes because if we kill that 10-minute server, we've
killed the agility of the cloud. If we prevent people from creating a
lambda function which includes creating an IAM role, we're killing the
agility of the cloud. So we have to think about how our security is going
to enable that agility and work with it at that speed. Or we've become the
block in the organization. Right. Generally, in security, we're cool to
say no. But we don't really want to be the block.
[00:38:20] So that's what we come to real time guardrails. The
ability to protect these services and capabilities and respond in
real-time. When people are creating these capabilities all the time
whether it's using the console a telephone whatever they're doing.
[00:38:33] We need to we had to fix it immediately. Nothing
else will get us there. If we want to say I have to review everything
before you publish it. We became the block.
[00:38:46] And by the way, yeah life sucks. Right.Bbecause
you're spending your whole time reviewing JSON documents and things for
people and you just become the bottleneck all the time. Instead what we
want to think about is what is the framework of policies and regulations
we care about. You can use SQS as much as you like. Just don't do
cross-account access. Right. Lambda is fine with IAM roles, provided you
with an appropriate boundary for how you're doing it. You're using EC2s is
great but it must be inside a VPC that's private and it must be of such
and such. So long as we have those rules we can give the business the
freedom and we're not arguing anymore about is that a good idea to do
this. We're basically saying hey so long as you patched in you're not
putting us at risk. Go nuts run a thousand servers. You're meeting my
requirement.
[00:39:30] Right. And then we can let them make more of their
own decisions and how they want to run with that. As you move to
software-defined infrastructure, you need software-defined operations. You
need software-defined security. Nothing else is going to keep up. We've
given up our custom data centers to move to the cloud for the agility,
consistency, etc. that provides.
[00:39:53] We have to now start wrestling with the idea that
we've got to give up custom processes, custom scripts, and a bunch of
hacks to cover the security and process of that, and think about how we're
going to define those operations in a secure, software-based framework
that can act in real-time. That's what governance at cloud speed and
cloud-scale looks like. It's a software problem. Once we get that right we
can actually give the people the freedom they needed which was the whole
point of why we exist in the first place, why we're doing this.
Application teams need the freedom to use these services. That freedom
gives the business the ability, the agility to move. The ability to
innovate etc. That's what our goal is.
[00:40:43] Give them freedom while keeping them protected and
accelerating them through services.
[00:40:50] So as we think about all these security decisions
we have to keep in mind that's ultimately where we want to get to. And how
can we automate those decisions through the automated response, the
governance, appropriate boundaries around people. It's not about an
opinion about how should work, but it's about the facts of what must
happen. And you know you'll win if it gets escalated up. Right. And then
there's things that should happen which is make them faster.
[00:41:16] So with that, I thought give you a quick look at
what governance can look like when you move at that sort of scale and
automation.
[00:41:23] So at McGraw-Hill they've used Turbot for a while
and what they do there is basically have Turbot running inside their VPC
in their environment, and it provides a model for identity, for access
across the accounts as Chinmay said restricting service to different
teams. So each user would log into Turbot and they see the handful of
Amazon accounts they have access to. The permissions in those accounts are
actually simplified down to standard levels like admin, owner, operator,
read-only, etc. And that's per service as well. S3, operator, EC2
metadata. You need standard language for that type of IAM. Or, you get
crushed at scale.
[00:41:57] So they have that ability. But the second thing we
want is basically the chance for those developers to actually just use
those services with freedom. Use the Amazon console, Use terraform, use
Cloud Formation, whatever they might like. So I'm just gonna do the simple
bucket creation thing.
[00:42:11] Everyone likes talking about buckets because we
understand them.
[00:42:20] So we want to give people the freedom we had to do
that sort of stuff. But what's interesting now is what happens next. So in
the next 10 seconds, assuming the demo gods smile on me today, what should
happen is Turbot will detect that bucket automatically record that change
in the CMDB along with who did it. Test it against the policy posture for
the organization and then automatically remediate any problems it's found.
What that means is we can now give our junior developers the ability to
work without fear them making mistakes or leaving us vulnerable, and we
can speed up our senior developers that didn't have to worry about this
anymore. So let's see what happened. So if I come into that bucket that I
just created you can see it's already turned on the versioning and it set
the tags. Right. And it's moving through the different things it needed to
do. Stuff like the tags includes rules like the cost center and so on,
context, right from your organization for this one account.
[00:43:16] So those things have been automatically fixed in
that environment already. Maybe the default encryption will come through
as well. Let's see what happens. So back in Turbot, we can see the buckets
got created. And we see a couple of things going on. First, Turbot's
tracked in this seem to be all the details about that bucket.
[00:43:42] It started to standardize some of the information
like tagging which helps when you get to a multiplatform world and you
want sort of things working across different environments.
[00:43:51] Alright, so we have that tracking in that CMDB. You
can see that it's recorded who created that. It was me with a pretty
picture of me behind a pretzel.
[00:44:00] And then coming up and you have a bunch of the
alarms that were raised versioning wasn't correct. Your default encryption
wasn't on. It then automatically remediated those issues.
[00:44:11] And close the alarms. That's a five-second ticket
close.
[00:44:16] So that's the sort of automation you want through
hundreds of services. We have like seventeen hundred policies like that
and moving through different services in the environment. So what it looks
like at scale. Now if we go into one of these things see this book it's
actually not approved. Turbot tells us why. It actually says it's in an
approved region. And you want to prevent everything you can. We do
extensive work to prevent things in IAM, but there's a bunch of things you
just can't prevent right. And this is just one example of when you might
not be able to. Of course, there's rules here. You could. So an unapproved
region. It's raised an alarm and said okay this one's approved. Let's look
over here at policies controls of what gives us alarms in the state of the
environment but they're based on policies. As I said we need a whole suite
of policies in our environment for what's valid. Which regions are
allowed? Which is encrypted? You're naming standards? So if we look at the
policies here we can see something like what is the approved regions. And
in this case, we're saying how U.S. regions are valid for buckets.
[00:45:13] Now in Turbot, the policy engine, in the way that
works is it comes down hierarchically. Instead of rule at the top. I want
to only use U.S. for all the buckets in all of my accounts and that would
be true now and into the future for every new thing.
[00:45:27] But what you often need to do is start to set
exceptions because maybe this one bucket's actually okay that it's in
Ireland, so you can create an exception to say, well actually this one
I'll let it live in Europe, and this is the exception that beats a MUST
(required) rule from above-recommended rules. You can let teams beat any
time MUST rules they require permission to do. And we can save that
exception. As soon as we do that. Turbot now knows about that exception
and it will rereview that policy, the control to say is it approved? So if
we go back to the control wing see it's flipped in our case state. This is
now an approved. This bucket is now approved to live in that region.
[00:46:09] In general those simple policies give us the power
we need for exceptions and stuff like that. And we try to keep that
language very very simple. What's cool is though you need to get to see
all the exceptions in your environment. Like I said you end up with a lot
of policies. So if we come up here we can actually see at least of all the
exceptions we've granted around bucket regions. If you're trying to manage
a security team got all the exceptions you've in that environment you
don't want them hardcoded in JSON and a thousand locations. What you want,
is this simple place we had to see those decisions, have them expire, set
rules around those things. Now sometimes you want to get more fancy in
your rules. And decide that, okay. It might be okay to have a bucket in
Europe if the tags and such and such people do really really crazy things
when you get to hundreds of accounts and scale. So what do you want tere
is stuff that gets even more powerful about the decisions. So we call
these calculated policies. Now in Turbot, because you've got all that
information on a macro discovered into the same CMDB.
[00:47:07] You can actually search that as part of a policy
decision. So here we can set do a quick search for like what for this
bucket. Turbot has normalized some information like the tags that's a
graph input query against the CMDB. And it's found all the tags for that
bucket. And then we can start to say with that information there. Now what
do I want to do with that. I'm just gonna grab a quick code snippet here.
[00:47:34] So that's some templating language just some Ginger
2, if you're into that sort of thing. Right. And what it's doing then is
it saying based on the information context for this one bucket and its
tags. Using this template we can make a decision about what we want this
policy value to be. You could look at the name of the bucket, does it have
a correct prefix you're tagging whatever you want it actually doesn't even
matter. Right. And then that policy gets calculated there for that one
bucket and it goes along from that. So we can then just say that policy as
we did before. Turbot is now going away in the background and calculating
that policy for us.
[00:48:07] And determining then the state of the control for
that bucket. Now if you have a thousand buckets under that policy it will
recalculate all of them individually and contextually.
[00:48:16] Right. So again you're now thinking in policy
posture, rules across your environment. You're not thinking about small
exceptions in text files in a thousand different locations once you get to
that sort of scale. Now in the same DB, we have recorded the information
about this bucket and we have it here, and we actually put that in context
in Turbot so we see all the different controls here for this one bucket.
So some of the standard controls you might want to think about in your
environment is stuff like active should these still exist at this time.
Right. You might set a rule like, in a sandbox account all resources
should be no older than 60 days.
[00:48:52] This to stop people running pseudo production in
there right. All things like the, is it approved to exist here, is the
fold encryption on. You can go nuts. We have many many many of those sorts
of things. Now if you roll up in Turbot you can see controls here across
the whole region. Divided down by the different resource types right. And
the state of different things. We can go up to the account. And see our
summary that controls there and we can keep going to see across different
platforms across different environments and all that sort of stuff. If we
sort our alerts here, we can see actually the top one is CIS controls in
this environment, and now we can start to drill down and see information
about CIS for our account. So Turbot organizers these hierarchically as
well, and these by the way, happen in real-time. You don't want a report
you run once a month, you get the results much later. You want to see how
it changes all the time. You change one thing, it happens real-time in
Turbot. If we drill into something like logging here we can see different
stuff. Like, for example, is the flow login enabled. Right. And then per
VPC, we can actually see the rules around that. So you can see that
control for each target resource in the environment.
[00:50:03] See the state of it over time how it's changed if
it became compliant, became non-compliant, we have all of that history
information.
[00:50:11] Now one of the things that's interesting is once
you get to this scale of automation and like I said you will hit hundreds
and hundreds of policies. It's a fact, and it's it gets hard to manage.
[00:50:21] What you need to do is start organizing that
information very carefully. What are the names of your policies? What's
the architecture of your approved, active, configured, your data
protection, standard names, and structures through all of that, so you
have a language to talk about. Otherwise, you're buried in minutia. The
other thing you need to do is start to categorize that stuff. So in
Turbot, what we've learned over the years, we started with scripts have
gradually built out, we now categorize everything. So we have a my bucket,
which is of type of AWS S3 bucket and is of the category storage
container.
[00:50:55] Right. Now you can imagine as you go across all the
clouds or other environments how that starts to group things into
categories or information. Same for controls. CIS is not only a benchmark
for AWS, but it's benchmarks for Linux, for other providers, other tools
out there. And those are categorized by CIS into a control framework. So
we can come in turbot and now view CIS report by the controls which is now
a cross-platform and multiple, full-stack view of the control framework.
[00:51:26] So we can see different things there and then we
can start to divide it up so we can say hey show it to me by the control
category. Right. We can drill down into different areas and cut it in
different ways. Right. So now we're in that maintenance section 6. Then we
can say, hey show it to me by resource category. So here's section six
broken up by the type of resource.
[00:51:48] And that gives you a lot of insight to what's
happening in the environment. The ability to target your quick wins the
different capabilities you want.
[00:52:00] We spoke about permissions before and the
importance of that as part of the setup of management environment. So
Turbot provides a commission model, which is what McGraw-Hill uses at
scale across the environment. They use it for Linux as well actually. So
Linux level authentication as well as the accounts.
[00:52:15] Now what's cool about this model is it actually
follows the hierarchy again, so we can grant permission to a user and
that's a straight-up search of our directory. We can choose which
permissions to grant.
[00:52:26] This is a simplified list. Like we track
thirty-five hundred Amazon permissions at this point. So if you're really
trying to do that at scale, I highly encourage you to go to a standard
language around how you want to do them. And it's basically then you want
to grant things like simpler, smaller groups of it, like for example you
AWS read on my S3 read-only.
[00:52:45] We support exploration of those sorts of grants.
So, you can say okay you get read-only on that for the next six hours.
Right. And then that will automatically expire. That's really important
when you want to start to have models like security group, your security
team. You might wanna give them access at the top level across hundreds of
Amazon accounts for a diverse metadata access. For us, metadata means no
reading of data. Right. Just seeing. Seeing the resources. Metadata,
read-only, operator, admin, owner, standardize that stuff out.
[00:53:14] Right. So once you give people like that access to
the metadata, they can see the environment and move right through it. Then
you might say, Okay you need admin on a temporary basis to be out of
troubleshoot something or fix things, so you use those automatic
expirations. We also, by the way, do temporary elevation if you're into
the really really hardcore.
[00:53:29] Stuff around that. So that sort of permission model
gives you a lot of flexibility. For your CMBD, you want to be out to do
things like search. Now to find the different resources in the environment
or to search by your details, like for example, I want to see everything
in my environment that's from the sales department.
[00:53:52] You also want to be out to run queries across your
whole environment, for example, to be added to use an API to start getting
stats out running extra reports going nuts right with that sort of thing.
That's what software-based governance looks like. It doesn't look like a
bunch of small scripts in different places you've got to manage by hand.
Now the last thing I'll mention is we've started to understand that, we
started off AWS, S3, EC2. Governance really is such a broad topic, with
the companies we work with we're finding that they're struggling with
governance across the enterprise. Your internal Domain Name Service
certificate expirations a little dummy example here. There are hundreds of
problems where we're trying to work out how to do this automated
governance at scale, across our pipeline tools our external SaaS providers
etc. So we've started to think about that as something now that's
customizable and extensible. So this an example of a custom policy, we
have a simple CLI tool which I'd be happy to talk about another time, to
develop this. But basically, all the policies and stuff are defined using
very simple JSON schema language definitions, and then they appear in the
policy engine. So you get the hierarchy, you get to calculate policies,
all that sort of stuff. The control framework, the slicing and dicing, the
reporting, and it gives you a way to start thinking about these scripts.
Beyond deployed, allowing you to have a lambda style code which then is
automatically deployed, run multi-region high-availability although logs
are sort of centralized in. So it takes that governance to a whole
different level and then your users can use those like they do any other
policies with the validation and stuff in the UI to give them access. So
if we change this certificate warning period to 365 days for AWS Amazon
thing, we'll see I think they've got about one hundred seventy-four days
left. Right. So they have time, but then the space you can go to for this
type of governance, the sorts of rules you can write for this type of
automated governance.
[00:55:43] Is limitless. And that's the future of security.
[00:55:51] Coming up with our rules. Defining those.
Consistently with good names and then running them in real-time.
[00:56:01] In an environment where application teams have the
power. And our service providers adding thousands of features a year.
[00:56:09] And that's why what we say is that in effect we
have to start to accept that movement. From a world of like
software-defined. As we move that software-defined infrastructure, we have
to start moving the software-defined operations. When we do that, the
speed of our organization is unparalleled. Our cloud teams move faster
because they've got automation; our application teams move faster because
they've got services to build on. We're safer than ever before. I mean
imagine saying every data storage was encrypted in your internal data
center. That's now an automation away. Right. We're better off than we
have been. Making it more accessible our junior developers can do more
stuff because they're not afraid of the mistakes they might make. And
we're more productive because we don't have to spend our time reviewing
JSON or reviewing things once, we've set our posture in place for the
first application. The first account. It's guaranteed for the second one.
We don't have to review each thing by hand anymore. What we do now is when
people do something crazy and come to us and say I need an exception they
were blocked from the start and now they're asking for permission. The
traditional model is they go crazy do a heap of stuff and then come at the
end for a review before production. And now we're blocking them right.
Automated guardrails running in dev and prod consistently change that
relationship and change the structure of it from one of fix this for me.
[00:57:31] You have to do this for me to one of help me solve
this problem. Help me get this service going in my environment.
[00:57:40] And that's why you see that change of relationship
happen is you start to have automation of all the basic stuff.
[00:57:45] So Chinmay was able to talk today about so much
stuff they've done with pipeline and building all these capabilities and
you were excited cause we feel like we were able to enable that by doing
so many of those basic standard things that organizations have to do over
and over. Beneath that automating the setting up of those guard duties,
automating the setting up with those flow logs, that's all just standard
stuff.
[00:58:04] That every organization does. So this thought, I
wonder why would we do it by hand.
[00:58:11] And that change of relationship happens once you
have that security, automation, and monitoring of the underlying layer.
You can give freedom to your application teams to use those services.
[00:58:20] They can work in a relationship with the cloud team
to move faster. They want to try a new service, you can create an
exception for it in one account. Let them learn, innovate together, right
and work out how to do it and then automate more of that stuff and then
others can use it. So you started to turn those standards into the things
are gradually automating and improving over time.
[00:58:42] So thank you so much for your time listening to us
today. We're super excited talk about this topic. If you have any
questions for Chinmay or myself, hit us up. We'll talk about it all day
and you have a good rest of the show.