Disclaimer: Automated Transcript
[00:00:00] So thanks everyone for coming. I know it's the last
session before... I'm between you and drinks. This is a dangerous place to
be. But you know we're going to see how we can go today. So thanks for
coming along. And what we're going to talk about is governance and cloud
governance in particular. Now, when I'm talking about governance what I
actually like to do is to start by thinking about government. What do you
want from a government. And what would that mean in their cloud
environment. So. Generally most people what they want from government is
three things really they want to be protected from both internal and
external actors. They want services that make their lives better. And they
want the freedom to live their lives and do the things they need to do.
Now we think the cloud is basically exactly the same. Your application
teams need the freedom to be able to use those capabilities and the
agility of the cloud. But you have to wrap them in guardrails and
protections to keep them safe. Right. Safe from each other and safe from
the outside world. And then they need a whole range of services to help us
move faster more accurately and more securely through that environment.
Now, the cloud changes really a number of our governance requirements
compared to we are internal.
The first one is it's just moving so fast. A thousand new features a
year. New services and capabilities coming out all the time. This isn't
your old data center anymore. And so when we think about governance we
have to be ready for them. We have to think about it in a new and
different way that's able to handle that speed. Second. Application teams
now have the power. It used to be that infrastructure teams had the power
because they held the budget. And they could build large services and say
it's not ready yet or you have to use my thing. That doesn't work anymore.
Application teams choose when servers are deployed. They choose which
services to use they deploy on-demand. They turn off on demand. They're in
control of the infrastructure now. And from a governance point of view we
have to be ready for that because that means we can't be proactive in our
reviews or doing these things because that means we're in the way and
slowing them down. What we need instead is to work out how we're going to
be real-time to keep up with the pace of those application teams and move
with them.
[00:02:15] Now while all that action and excitement is going
on we also have higher expectations than ever before. I mean when you had
a show like this it's kind of like: Oh my God you didn't turn on flow
logs, what are you thinking right? Or you guard duty, you haven't used
that and got your findings yet? You're not encrypting everything? Again.
Can you imagine in our old days of the data center if everything wasn't
encrypted. That was just kind of normal and expected even though we kind
of wanted it to be. But now you have to do those things if you're not
you're missing the mark. And by the way, those things are moving forward
at a thousand features a year. Those expectations are so high. And in
addition, if you get it wrong you probably will end up in the news. So, we
have to now think about that control framework and how we're going to be
real-time accurate with it at that scale and breadth. Now at Turbot what
we believe is that that means that we're really changing now to a world of
software-defined operations. We've had custom data centers, custom
infrastructure and we've gradually standardize that into the cloud
providers and AWS. We still also had all these custom processes and custom
controls and custom the ability to get things in our data center wrapping
that custom data center. We have to rethink that for the cloud. That
software-defined infrastructure needs software-defined operations. Nothing
else will keep up with the speed consistency size and scale of it. It has
become a software problem. It's not a scripting problem; it's not a
process problem; it's not a reviewing problem. It's a software problem.
And we have to start to think about and accept it in those terms so we can
really prepare for that future. While we're doing all that we need these
services to speed people up, so we have to start thinking about how we're
going to automate all of this. And you can't automate it if you're not
clear on what it is. Working at a large enterprise, one of the things we
went through is we're trying to move to the cloud was, "Okay. You can go
to the cloud but it has to do everything we do now. And then you ask the
question what do you do now?" After you've spoken about 80 people and
worked out that Joe in the such and such department names the subnets and
someone else chose the AMI it started to unravel all the processes and
procedures, you realize that that's almost an impossible thing to
replicate unless you actually use the same process. And that process takes
six weeks. So the whole 10 minutes server really ain't going to function
in that environment. To be able to automate that you have to think about
the whole. If you're going to launch a server in 10 minutes, what do you
need from a security compliance and your architecture perspective? You
need to understand your networking. You need to know where you're going to
land at what subnets are going in. Does everybody agree that's a valid
network? Does that network have reachability? Once you're in there what
security groups is it going to use. Who designed those what ports can it
have? Is this a Java application? Does it have application ports open?
Security groups need to be there. What AMI I am I running? Is it currently
approved? Is it old? Is it patched? What size is it allowed to be? Right.
And even after I get in there we might want to tag it for cost reasons
etc. but I got to get into that AMI and log into it. If I don't have
authorized access in that 10 minutes, I didn't have a 10-minute server. I
had a 10-minute server with a three day wait for authorization. Right. So,
what I need is to think about is that 10-minute server. To me that's
actually the ultimate challenge when you're going to cloud to stop me will
want to do serverless, they're cool things. But if you're sitting with a
CIO, I think the challenge to say is I want your support to get to a
10-minute server. Because to get to that server we have to solve so many
questions about operations, network, monitoring, security, approval,
processes... those things. That wraps up the whole thing and one question.
That's measurable.
[00:06:07] You could show that in a meeting, right. So you got
to think about the whole as you do that architecture and the automation of
it because it's very very repeatable. The other change that happens is
that application teams actually start asking for help.
[00:06:23] When we're in a data center and we had budget
differences. What happened was we ended up with is I need a server. Well,
you got to put in a request and then I got to track that request. So, then
I get a project manager and then you need a project manager to fight with
my project manager where you're really going to be out of control. So, now
we've got project managers at 10 paces trying to get this stuff deployed.
That completely changes because once you go to the cloud with a model of
self-service you're no longer asking, "give me a server. Is it ready yet?"
You're now saying, "Can I launch a server? How do I do that? Please help
me do this." So, if we get our processes, procedures, security, controls
right, we can stay in that world of self-service. Stay in the world where
we're helping people be successful. And stay out of the world where we're
responsible for every request, where we're the bottleneck for every
approval. But to do that we have to think about our architecture and
automation end to end.
[00:07:21] So for us what it comes down to is a number of key
principles you have to meet in the cloud to get this working. The first
one is you have to understand your rules and regulations for how you want
to operate. You can start with standards like CIS and other things, they
are a great place to start, but actually, you have quite literally
hundreds and thousands of questions to answer which is an awful thing to
say but it's true. What's the name for every server; what's its Hostname;
what IP addresses are we using; how do we name lambda functions; what's
our tagging strategy; are we allowed to do those? How do you feel about
cross-account access to Lambda aliases? Which by the way is separate from
the versions and separate from the functions. You have to answer each of
these questions so you can build that automation framework out and stay in
control. While it's moving at a thousand features a year. And what we find
is in actual fact you might make those decisions. And then, of course,
everything is an exception. Right. Every project has one thing they need
different. So, you can't think in terms of what's my control policy at the
top that just won't work. Once you have a thousand accounts and all these
different service and buckets you need to set rules like S3 must be
encrypted except for this one bucket. Or except for this one account. So
you have to think in terms of policies and exceptions. And you're gonna
have a lot of them. You might start with 10 lambda functions; you're going
to end up with hundreds of policies in a software package running. The
next thing we have to do is think about our infrastructure and that's
really the services we're providing. How we laying out and making sure
we're always turning things on the right way. We want to make sure we
always have guard duty for example or flow logs turned on that we've
always got cloud trail on laying out that common architecture and moving
it forward consistently across those accounts. As we move to cloud of
course we're moving to that a help model and that means now it's way more
about collaborating and educating our users and working with them. Now,
that involves our for us two models; one is first your juniors you want to
give them a lot of freedom to use it but also keep them safe. Let's just
fix things, so they can learn by watching and our seniors we want them to
move forward without having to do all of that grunt work. So let's make
them more productive and our juniors more safe. The way you achieve that
is with real-time guardrails and automatic remediation. When someone
creates a necessary bucket. Within seconds it should have encryption on
the tag should be right. What version should be on whatever you've chosen
your posture to be. If you're playing whack-a-mole on tickets. You're
going to spend the rest of your life chasing people and asking them to do
things that they don't care about, but you do. If you automate these
capabilities out, it changes the equation. Because what happens? I created
a bucket and it was public, and you won't let me do that and that's
blocking me. It's like, yeah, that's blocking you. Come and ask for an
exception. Right? We can have a conversation about why you're doing that.
True story. I heard that one time. It's so I can store the keys for the
external C.I. server. Right? That's the site for silly conversations you
end up in right? But what you want to do instead is start saying no-yeah.
It's blocked, but we can talk about and giving you an exception. We can
discuss how this service should work together. Right? By moving that
conversation to the start through automatic remediation, you have a good
chance for success. Traditionally we're sitting at the end. Right? And
we're stuck in that: I have to go live, my deadlines tomorrow. This unit
needs it. And you're blocking me because you won't let this happen. All
right. By flipping it around it changes the conversation. Once we have
those pieces in place we can give our teams that freedom. The freedom to
create things, the freedom to work because we can trust that it's in
control. If we don't trust it's in control we have to review everything
before it happens or after it happens. We're constantly fighting that
battle. Once we know we have a policy posture that's being enforced. We
can move forward with way more speed and flexibility and consistency. So,
what I thought I'd do is show you a little bit how Turbot works to try and
achieve that.
[00:11:37] So Turbot runs as software that basically allows
each user to see the Amazon accounts they happen to have access to log
into. We have a whole identity model for choosing that. That's I'll talk a
bit more about. But generally what you want is, you want your users going
straight into the console or the APIs or terraform or cloud formation. You
want to give them that freedom. Don't force them into a pipeline don't
abstract them into something that they can't use. Let them have that
flexibility. Encourage them to do other things like pipelines. Don't force
them there. Once they're in there we just want to do something simple like
creating a bucket.
[00:12:15] So it's a good sign when you demo bucket's called
13. Right. So we gonna create a bucket and let that go. What's gonna
happen now though. You know fingers crossed. Is in the next 10 to 20
seconds Turbot's gonna detect that new bucket. It's going to record it in
the CMDB along with who made the change. It's gonna find all the problems
with it and then automatically fix those.
[00:12:39] So, we're looking for that automatic remediation
and fixing all of that bucket in that real-time. Check it out. So, we come
in here when you see the versioning has already being enabled.
[00:12:52] You see the default encryption has been turned on.
We see the tags have been set. I didn't have to do anything all that
happened magically around me. So, now my bucket's compliant. If I go back
to the Turbot console.
[00:13:05] Just to refresh and not the activity list, yet.
[00:13:12] We'll see on your bucket's appeared there in our
activity list created by me. And we can go in to see the detail about this
recorded the full CMDB information here including every detail of that
bucket. That establishes a baseline and now changes will be tracked with
differences. Below that, we can see the full activity for this one bucket,
so if we go down we'll see when I created it. Turbot came along,
established policies and controls, and then we can see that certain things
went into alarm. And eventually got fixed here. And then resolved to okay.
So, we have a five-second ticket close on those items. It's really good
for your metrics right. If you play that way. Now what's fun here is we've
also pulled out things like tags and made them a top-level citizen. This
is very helpful when you want to start going cross provider, cross
platform. We can say we have one control the approved one that's in alarm.
We go in here for detail on why it's unapproved and why. Because it's in
an unapproved region. Now when you have controlled the big thing to decide
is what's the policy posture for how we made that decision. Right. We
don't want to say it's always unapproved because we need exceptions and
stuff like that. So in Turbot, the way that we do that is with policies.
So over here on the right, I can see policies like using approved and then
things something like the region. A lot of these are very simple. So this
one here is a check that it's approved. It's just in that checking mode.
Others have more flexibility. For example the regions one is a simple
YAMIL list of like these. Of wild cuts. Now here we can see the policy is
actually set higher up on a folder. So this is applying to every bucket in
every account now and in the future. It's not one-off, we didn't have to
say I want this bucket this time down this way. This is a posture we've
taken as an organization. But we can now create an exception for that. If
we have permission and say actually you know what. I'm OK for this bucket
to happen to live in EU-West 1. And we'll just save that straight away. By
creating that exception, we are now saying these bucket's okay to live in
a different region to the others but we've also created visibility to the
differences in our environment. So if we go up we can actually see into
about a list of all the exceptions to this rule. So as a security team you
kind of like hey I've set this rule I've granted these exceptions you can
see them all in that environment for that control. So that one now has
gone to approved because it's now in a valid region given that policy
structure. Most of the time your policies and exceptions pretty simple
like that. But sometimes you want to go a little bit crazy. I want tags
and I want prefixes of the bucket name and you know it's Thursday or
whatever it is to say you should have an exception for that rule. So in
Turbot what we do is we support the idea of calculated policies. Now the
power of this is actually that we can query our CMDB In real-time. For
information about the resource that's being protected. So that's a GraphQL
query that is now found all the information about the tags and stuff for
that particular bucket. If we combine that with some templating, I'm just
gonna grab a little snippet here. This is Ginger code for those who like
Python. So, this is a ginger template saying let's always allow US. But if
the tags of the department is sales then they are allowed to create
buckets in Europe. Right. If that tag is that in that way, and we can see
that the policy is evaluated to the YAML and eventually come down to a
list and where we can set that policy. Now what's cool about that is we
can do things like for example say this is a valid policy for the next 90
days and then you're exceptions over. And they'll automatically revoke.
Right. But when we save that policy. What happens now in Turbot, it goes
off and calculates that policy for every bucket in that scope. It's
determining the policy value based on the context of the resource it's
protecting. Right. This is wildly powerful and flexible. You can't even
begin to imagine what's possible as you really get into it. Fingers
crossed it's gonna calculate in the second. There we go. So it's gone and
calculated that it said that to the regions that bucket still approved
because of that calculation. Now if we come back to that resource. Here.
We can see a couple of things. I'm sorry wrong place. So. This bucket here
let's go and browse it. Now what Turbot does is it finds these things into
CMDB; it actually arranges them all into a hierarchy. Which is what we use
for the policy engine structure. Right. So that hierarchy says this bucket
actually has these controls on it we can see which ones are green which
ones are skipped etc. But now we can start moving up the stack. Before I
do that. Notice that we have very standard names for controls. Is this
bucket active as in was it last modified recently, has it been used
recently? You might want to have a sandbox and just delete resources if
they're older than 60 days. You can do that right.
[00:18:25] Stuff like versioning turning on versioning,
tagging, standardized default encryption, is this bucket approved to
exist, these are standard categories of controls for us, standard language
we've learned and built over time. Because once you start doing this for
one hundred and twenty services with five resource types each. It gets
pretty confusing if you've had built all of that stuff. So that
standardization flows up. We have buckets arranged into regions. We can
see the status of our controls for the region. We can go up to the account
level and still calculating the status of all of our controls.
[00:19:00] Now if I sort these by alerts instead, we'll see
that actually the area where for the most things in alarm is CIS. Now we
can start drilling down into CIS report for this one account. We could do
it across all the accounts across all the providers it doesn't matter. But
we can drill into this and see by area. Notice how it's all hierarchically
arranged into these types so you can slice and dice in different ways. Now
if I come down I can actually see the status here of for example flow logs
for this one VPC. And we get the full details of why and all that stuff.
Our CIS controls run real-time. You create a VPC within 10 seconds, the
status of that's in CIS Report. Now over here we can actually start to
compare or see how this stacks up. This VPC has a few errors. All the
pieces have a few hours. This one's not much worse than them in general.
Networking is roughly the same. But look at our categories here for the
CIS report. You can see that it's...they're all...failing this right now.
And so we can start to crosscut from that and get into that. The bottom
one to me is the most interesting. Most people know about CIS benchmarks,
AWS CIS benchmark, there's other wones for other providers. This CIS
benchmarks actually map to CIS control framework put out by the Center for
Internet Security across Linux and all those different things. So you can
actually categorize these controls across say Turbot supports all those
categories for that reporting. So we can actually see now in a different
environment across the whole environment: hat is the status of that one
control? And then we can cut it by different things. For example, let's
cut it by resource category, so we can see for this control here's all the
networking resources, wherever they may live, cut in that different way.
[00:20:50] With that CMDB comes a lot of power. You can start
searching for different things like our bucket we just created. But you
can also search for things like tags. And they will find all the resources
matching that tag. You can combine that with searches against native
fields that are determined by the provider.
[00:21:14] So you can start to find any sets of resources here
and query different new pieces of information out of that CMDB.
Governance. That was a bunch of examples of some cool governance. We have
by the way seventeen hundred policies like that so you can really kind of
go insane trying to work through that and work out how to do it. But what
we've found is actually that governance is a problem that exists not just
in the cloud but across the organization. Right. We need the ability to do
governance of certificate expirations, DNS servers, all sorts of different
things. So Turbot actually supports complete extensibility of that policy
and control framework you saw. You can write your own and have them appear
in that UI with all the power of the inheritance the calculations etc.. So
the one we wrote a little bit earlier (not that one) this one. So this is
the Expiration check on the SSL certificate that we worked on. I'm just
gonna refresh this because it was pre my network change.
[00:22:15] So we have custom policies here to determine things
like the expiration check or the warning period or stuff like that. These
are policies. When you edit these they have a nice enumerated list. Those
are actually defined over in Turbot. Here in the policy definition. So
these adjacent schema tying that together you can build those and then
push them up to your server and deploy those policies then all your users
to subject to that validation for those policy changes. The hierarchy of
exceptions etc. For whatever governance tasks you may have.
[00:22:49] That works and then the normal way like everything
else we just saw. For example, if we set our expiration date here to
instead of 90 to be like 365, we'll see the policy, the control will rerun
using that new policy value. And we should see a change in the status of
the control in like one second.
[00:23:10] So Amazon's got 180 days or so to fix this
certificate right. They're in good shape. Right. But that type of power of
being used to build those sorts of governance controls in this sort of
platform framework, that's what we're talking about when we talk about
software-defined operations. We're not talking about a few BASH scripts
and I'm talking about a few lambda functions, we're talking about
consistent language, at scale, supporting exceptions, with a clear model
for how to build it out and run it in real-time. Right. And that's the
sort of capability that Turbot provides.
[00:23:45] So when we think about what are the benefits of
this type of governance. Right. In our organization, the first is
obviously speed. When you bring in this sort of software you don't have to
go into find all those things from scratch. Your cloud team is massively
enabled. Right. Writing those functions from scratch just like everybody
else did is not adding value to your business. For your development teams,
they're moving faster because you can give them more freedom. The freedom
to learn, the freedom to try, the freedom to do things. Second, our safety
has gone way up because we're now in control of the environment, and we're
fixing it in real-time.
[00:24:23] We've made it more accessible to our junior
developers and more productive for our senior developers. And we actually
have a model that will allow us to deal with the sheer pace, breadth, and
depth of this cloud type of environment.
[00:24:39] So that's governance. Thanks for your time. I'm not
standing between you and drinks, but if you have any questions I'm happy
to talk about it for as long as you'd like. Thank you.