Disclaimer: Automated Transcript
Nathan Wallace: [00:00:00] What I am here to talk about is
governance and in particularly governance for the cloud age; and how we
think about governing all of that usage and all that freedom and all that
agility that cloud gives us in a way that lets us move faster and be
safer. [00:00:15][15.1
[00:00:16] So the first thing we need to think about really is
what do we mean by governance. Now we're not talking politics I know we're
in the middle of an election cycle here and coming from the US we just
don't talk politics right now, (right) but we need to think instead about
the rules and the regulations and the way we want to actually operate.
That's what governance means and having done this for a number of years
now and working with different enterprises from very very large banks,
pharmaceutical companies, education institutions trying to do a lot with
dev ops; the thing that I've gradually come to realize is that frankly
governance is about giving freedom freedom to application teams and
freedom to our business. When we get governance right it doesn't slow us
down. It doesn't stop us from doing things instead it accelerates us and
enables us to have the freedom to use the cloud. The key is getting the
right combination of those factors together setting the right rules
working out how to teach people and accelerate them through good
infrastructure and working out how you can protect them not only from the
outside world but also from themselves. And when we get that right all of
a sudden we enable this freedom; this freedom to innovate, this freedom to
use all the incredible technology that the US is giving us and also the
freedom from regulation internally and silly processes and crazy meetings.
So as you think about the governance piece for your organization I
encourage you to think a lot about how you're going to give those
application teams that freedom
[00:01:50] How are you going to give the business the agility they need
while keeping the enterprise control you require? That agility plus
control is the fundamental balance we're trying to get right. When we talk
about governance I think it's also helpful to think about governance just
in terms of your everyday life. When you think about government and good
government what it means to you. Typically what we think of is trains
running on time, health services that are taking care of us and our
family, education that's available for our children. It's lofty but the
goals that we're doing for our organization here are really no different.
We're trying to work out how can we teach those teams, how can we give
them the support they require to be successful and how can we keep them
safe. Now we've been doing that for a while in enterprises governing
things. But traditionally it was a whole bunch of servers probably sitting
in the basement somewhere and traditionally they were protected by a lot
of things like procurement processes difficult to procure difficult to set
up a whole bunch of gates and controls. I mean we worked with
organizations where even a VM provisioning takes six weeks. And that was
after it was optimized to precreate the VM is in advance just so that the
process could be sped up (right). The amount of paperwork and things we're
used to going through was astounding. Now cloud of course completely
changes those norms. Sometimes we like to try and say it's just the same
except maybe it's software but frankly we're ripping down all of those
previous ways we used to work and trying to think of new ways to really
make it successful. The first thing we have to wear off and we're building
that governance is the fact that cloud is just moving so fast: a thousand
new features a year
[00:03:31] Yeah I hope everybody's ready for the S3 batch object handling
that came out yesterday. If you've already got S3 enabled in your
organization what's the status of S3 now? Can people use that? What are
you allowed to do with that? The pace here is crazy, it's exciting and
it's wonderful, but our challenge is to work out how to not fight that,
but instead embrace that and turn into a positive for our organization.
[00:03:56][24.2
[00:03:56] We have to ride those rockets. If you're building services in
competition with the cloud you will gradually lose. Elasticache, RDS, just
backups things like that, they just work so well and you'd be crazy to try
competing with it. [00:04:10][13.5]
[00:04:10] The second thing is and this is often very difficult to accept
application teams really have a lot of the power now. Traditionally,
infrastructure teams had the power why because they had the money, they
had the provisioning process, they had the right to say yes or no to
workloads. They got to say which services were approved and accepted
through procurement. Application teams now control the infrastructure
whether it's autoscaling servers or serverless. The point is that this
time application teams are really provisioning the things they need.
Scaling them in real time and frankly they have taken a lot more control
of that relationship. You have to work out how we're going to support and
enable that. [00:04:48][37.4
[00:04:49] From a control point of view though our life didn't get easier
as we're trying to get this balance right. The expectations now are so
high can you imagine being in a world where every disk that sat. Existing
data center in the basement had to be encrypted. Oh and by the way should
be encrypted with a key that's probably sitting in a HSM module. That's
ridiculous to think about, we were nowhere near it. We were sending tapes
off to a mountain somewhere, hoping that that backup would come back and
work the expectations in the cloud now mean everything must be encrypted;
everything must use the correct key. You better be doing all it logging.
Oh? You're not doing VPC flow logs to track traffic in your environment?
Have you turned on guard duty? The number of services here to help you be
secure and compliant and make sure that your environment is working well
he's increasing all the time, which is wonderful giving us amazing
capabilities we've never had before. Which is also wonderful but we have
to work out how we're going to meet those expectations because if we miss
we do not want to be that person with that exposed S3 bucket or that one
in the newspaper on a given day. So the expectations are high and we have
to meet them. Of course at the same time all of their physical
infrastructure became software defined and that means it moves in real
time. So our software defined infrastructure really now needs software
defined operations. If you have a manual approval process on that
infrastructure you are too slow. If you're using a spreadsheet you're out
of control. So we have to think about how we're going to move our posture
from one of the old process to one where we're really controlling our
infrastructure with those software defined operations. Operations...
Security... these have moved from being people problems, process problems,
to being software problems; and our challenge is to work out how we're
going to embrace that, and move with that, because if we get it right the
speed, pace and accuracy we will have is way beyond anything we've ever
had before. [00:06:37][108.4
[00:06:38] To get automation right of course it's actually incredibly
difficult. We have to have such clear definitions of how things work.
Every exception better be very very well-defined. Every configuration
needs to be known. If you think about your existing organization and all
the policies and procedures you have it's probably like 35 documents (e.g.
"Oh, we know how to name those because Bill over networking always sets
those up.") None of this works or scales once we get to a world of
software defined operations; so we really need very very clear and
consistent architecture. [00:07:10][31.4
[00:07:11] When thinking about how to make cloud successful in large
organizations the single thing I would now ask for when going to
management is I would like your support to get to a place where we can
provision one server in the next 10 minutes that everybody in your
leadership team agrees is an official, valid, blessed server for our
environment. Okay it's a server it's a little bit old school in the world
of serverless but here's why it's interesting because to do that we need
the ability to self provision. We need to know that the network where
provisioning into as well defined managed etc. We need blessed allies. We
need patching. We need the ability to know we're monitoring it. We need
the ability to log into it. The ability to log into it is a big statement
in the world of a server. How do we know that that new server we have
permission for. If your internal procedure requires you to have approval
for every new server for access. That's not a 10 minute server anymore.
That's 10 minute hardware with a three day approval process. So getting
that full end to end architecture well-defined is critical to the success
of that operational governance environment. [00:08:13][62.2
[00:08:14] The good news is that it fundamentally changes a whole bunch of
the things about the relationship we have internally. We're used to a
world where people request things and others deliver them. I need a
server, I need access to this thing, I need storage. Oh, fill out this
form... Let's do that etc. Now when we move to cloud those things are
instant. They're available. We can pretend they're infinite. And the
relationship changes with those teams if we can move from one of support:
"I want to server, you need to give it to me." To one instead where it's
like "I need a server, how do I do that again? Which button do I press to
start that? Which API should I call?" If we can move our relationship
between our central governance teams and our application teams from one of
support and request giving to instead one of help and teaching (and I've
seen this happen this is not fantasy land) this is what happens when you
can provision that server so quickly, it changes the relationship. And
why? Imagine when you had that application and you come you need this
infrastructure so you'd come to a meeting and you'd sit in a meeting room
and say I need this infrastructure. You bring your Project Manager, they
bring their project manager you put your project managers at 10 paces from
each other and start arguing about timelines and when you going to have it
all those things. [00:09:28][73.6
[00:09:29] In a world of 10 minute servers you go to the meeting and you
just start the thing right there and all of that crap melts away. Now the
discussion is how big? When? Well it's up to you, right? You can do those
things provided you are secured, patched and stuff like that. So cloud,
really, and the governance around it can change completely the way we have
those relationships in the business, and gives us so much opportunity to
do it better. [00:09:51][21.9
[00:09:51] So when we think about that governance for the Cloud Age and
how do we actually achieve that. There's a bunch of ways you can tackle
it, but here's some thoughts for the key things to think about: The first
one is you better define your rules and regulations. You can be an
incredibly secure, uptight, "I'm going to approve everything type of
organization", for whatever reason: healthcare, financial services, you
just care a lot. That's a very, very, very aggressive approach to
governance. On the other end you might say "Hey here's your Amazon account
there's some keys. Good luck.", and frankly we've seen both. Once you have
400 accounts that were in the "Here's some keys, good luck world" and then
you start trying to work out how to govern and it gets increasingly
complex of course. But if you're at the other end where everything has to
be done exactly; you have locked up your business so much that they have
none of that freedom anymore that we care about. So while we're avoiding
politics there still is a discussion about: are you a big government
believer or a small government believer, when you're thinking about your
governance right and how you want to tackle it. [00:10:49][57.3
[00:10:49] So in Turbot, the way we think about that is you really need a
policy engine. You need a hierarchy of policy settings. We have seventeen
hundred but basically the way it works is we have a hierarchy of those
settings coming down. You set rules at the top like S3 must be encrypted.
Simple example, a nd that will then flow to every bucket now and in the
future in real time. But of course you're not doing anything real if you
don't have exceptions every enterprise thrives on their exceptions and how
special they are. [00:11:15][25.8
[00:11:15] So in that policy engine you need the ability to say "this
account does not require that" or "should just be checking mode, not
enforcing mode" or even "this one bucket should be skipped for these
purposes". So as you think about those rules and regulations make sure you
think about how you're going to manage those exceptions at scale and that
flow of policy in the environment. [00:11:32][16.9
[00:11:33] The second thing is really infrastructure or services. This is
what we want government to provide, right? Give me great trains. Give me
great health care. Give me good services. Make sure you answer the phone
quickly... And that's what we're thinking off for cloud as well. We want
the ability to have really good networking out of the box: it's nice, set
up, works well, safe for me. One of the most fundamental decisions you're
going to make is your account structure. I'm sure many of you already been
through it. Should we have one or two, should we have four or five
hundred. Generally we recommend you should go to more accounts isolating
your workloads almost like mini data centers for each application you're
running or even application environment. So really think about that core
infrastructure you're providing as that governance team. What's your
posture for that. How you're going gonna set that up, and what your best
practices are. [00:12:16][42.8
[00:12:17] After that we need to educate our users and make sure they know
what they're doing and how to use this. And one of the interesting
insights we've had working with lots of customers going through this
process is actually, in an enterprise, you want to move more away from
self-service for those accounts and more towards a high-touch onboarding
process. Not because it's hard or it sucks but so you can teach people and
review their architecture and think about that with them as they're coming
into the environment. [00:12:42][25.4
[00:12:43] By doing that in a high touch way you start to create those
relationships, that lead to that world of helping them achieve their
goals. Not one of: request > fulfill, request > fulfill... So
unintuitivly, it's actually good to think about helping with that
architecture onboarding... The high touch process of bringing them in, and
then letting them run with the self-service from there. Of course
protection. Protecting those applications both from the outside world, the
other applications in the environment and even from themselves. In our
mind what that means is really: real-time guardrails. Whenever an S3
bucket is created it better be encrypted, the tags better be correct,
access logging better be on. When you create a dynamo table, backups
better automatically get set. Every network should be configured and
running the right way. [00:13:29][45.4
[00:13:29] Real time responsiveness to that changing infrastructure. You
don't really control the infrastructure anymore because the application
teams are now in charge (they have that power). What you can control is
the response to that, the automatic remediation and the posture you want
to wrap them in and that's what we think of for the protection.
[00:13:46][16.9
[00:13:47] The key here is to move from a world of checking and reporting
and playing whack-a-mole with tickets and running around going "oh, you
really got to encrypt your bucket" and instead flip to a world where it's
happening automatically. You want to kill those tickets not just close
them. If you prefer ITIL speak: think of it as... Instead of managing all
those small incidents, think about the problem. What's the problem? They
were able to create infrastructure in a way that is not configured to our
standard. The root cause is we're not automating those fixes. So we need
to bring in automation to automatically remediate and fix those things and
make it happen in real time that will speed them up and protect our
environment. [00:14:22][34.8
[00:14:23] Once we get all those things right of course, the goal here is
to get freedom. Freedom for those application teams, and the ability to
use the cloud to do what it's intended. In our mind, the key to that
freedom, is not abstracting them from that cloud. You need to give your
developers access to the console. They need access to APIs. They need the
ability to use cloud formation, terraform, all those infrastructures code
tools. The combination of those things is how they will build their
application, it's how they will learn, it's how they will follow tutorials
on the Internet. Anything you do to abstract your users away from that...
is making it harder for them. Harder for them to learn. Google searches
are useless because they can't just follow the instructions. Even
something as simple as saying "we do everything through a pipeline",
that's a beautiful thing by the way (infrastructure as code) we do it all
the time, that's a good goal. But if you say "everything must be done
through a pipeline" you've just created an abstraction. You've now said
"you must always use confirmation there is no console for you", "You must
always use terraform", "There is no other way to do it". Now, that's
really great, and it's a good goal. Maybe for production that's the
perfect goal. But if you think your whole environment can work that way
with that level of rigidity, you're going to find that you're stifling the
freedom and innovation of those teams. So some of those things are good,
they're necessary, but they're not sufficient from a governance point of
view. From a governance point of view we need to do things like react in
real-time to what's happening in that environment and make sure we're
repairing it. [00:15:49][85.6
[00:15:49] That gives the teams the freedom to work while knowing they're
still secure protected and covered in that environment properly. So,
freedom, is our goal. That agility for those business teams is our goal.
To do that, we need to give them access to the tools and services and
stuff they need and that means making as many of those available as
quickly as we can, with as much direct capability as we can.
[00:16:13][23.4
[00:16:13] So what I was gonna do now was give you a quick demonstration
of how Turbot looks and thinks about that problem. This is one way to
tackle it and then we'll come back to talk about some more things. So
Turbot for us runs as software in a customer's environment not SaaS. We
want our users to have direct access to those Amazon accounts, so each
user would log in and see a handful of accounts. They happen to have
access to the first and primary thing we want them to do of course is
actually just use the amazon console. We don't really want them sitting or
abstracted away inside Turbot for that. So I'm just gonna do something as
simple as creating an S3 bucket. Having done that now the bucket creation
is not that exciting... I appreciate you bearing with me. What's cool
though is the next 10 seconds. The cloudwatch events are going to send
that to Turbot. it's going to detect that new bucket. It's going to record
that bucket and who created it into the CMDB. As a result of having that
information in the system it's now going to test it against all of our
real-time controls. Does it have a valid name? Is it in an approved regio?
Is encryption on? All of those different controls that you care about.
[00:17:13][59.3
[00:17:14] It's gonna raise alarms for any problems it finds. It's then
got to automatically remediate those, and close the alarms. So if we go to
that bucket we just created (and if the demo gods are shining on me today)
we should see in the properties that Turbot's come along and started
setting things like the versioning on, it set the default encryption up,
it put the tags in place. So I spoke before about that policy hierarchy
that engine of policies including things like setting metadata for cost
centers, recording things like who created the bucket, that sort of
information, nowing what are compliant tags. All of that flows through the
system here to create those things in real time on that resource after it
is created. We see server access logging now also being enabled.
Meanwhile, back in Turbot we should see the new bucket start to appear in
our notification list. Turbot has detected that bucket and brought it into
the CMDB; in the controls tab, we can see each of the things that checked
and ran with at that time. Including, for example, things like the tags.
As I mentioned you wanted to be tracking all these changes in real-time.
So the first thing is to go from okay to alarm state. The second thing we
did was we fixed those tags and then the third was we closed the alarm.
That's a five second ticket close. That's difficult to do in a manual
environment. Now we can do some of these things with lambdas and stuff
like that, but what we've found, is that you really need a lot of
visibility into what's going on in that environment so you developers and
application teams know what's happening; and so you can for audit purposes
really get to the heart of it. [00:18:38][84.6]
[00:18:39] So we track things like what event led to that to
happen. This was a "create bucket". And we track things like the context
that we had at the time, what was the policy settings and stuff at the
time. So we have for posterity how that decision was made and how it acted
in the environment. When you are taking aggressive actions for example
deleting a bucket in an unapproved region you want to know why and how you
made that decision. The way the controls work is determined by the policy
engine server and the policy engine we can see that this one was set to
enforce setting tags on the bucket. I happen to have permission to be
allowed to create an exception, most users wouldn't be allowed to do that.
We like to break our rules into "must": "You must do it this way"
("required" is another word for it), and "should": "Here's a
recommendation", "this is how your postures should be". So in this case
I'm gonna create an exception for this bucket... I'll just set it to
check. Of course you want expirations on those sorts of exceptions: 90
days approval... "Fix it up and then and then you're back to normal like
everyone else". When we create that exception in Turbot, it will now
manage that bucket according to that. That's a single setting for one
bucket in one account out of potentially hundreds. The cool thing in
Turbot, is of course, that you can then see all of the exceptions below
you in the environment. So for your security teams and your compliance
teams when they want to know what is the posture in this environment they
can start to see the full information about the exception the settings in
the environment. [00:19:53][74.3
[00:19:54] Turbot also let you manage things like permissions for the
environment. We simplify 3.5 thousand AWS permissions at this point, down
to simple levels like: user, metadata, read only, operator, admin and we
do that per service. So if I go to add a grant, we can see things like the
metadata, read only, operator, admin, owner and superuser because there
are special times. For each service we break that up that way, and it
gives us the ability to do that through the whole AWS stack and more. Of
course you want your permissions to be time-based so you can do automatic
expiration, and things like that. [00:20:26][31.7]
[00:20:27] So that's a quick taste of what governance can look like in
real-time, at large scale, with policy settings and all the stuff we're
talking about there. To give the application teams the freedom to use
that. In terms of the benefits of this... When we get it right what
happens? [00:20:38][11.7]
[00:20:39] The first thing is we just get a massive amount of speed for
something like but that's software right. I said operations is now a
software problem. It's not a people problem it's not a customization
problem it's a software problem. If you install this offer in an
environment or choose a different one. Seventeen hundred policies coming
out of the gate. Best practice for identity and access networking. All
those things are automatically configured in the environment. The speed
you get from that type of change is astounding, compared to building
yourself. The second benefit is safety. We have one customer they
implemented Turbot because they're development teams like "Hey we got this
we don't need anyone in the middle we're going to create our own stuff".
What's the first thing that happens? Someone creates a necessary bucket
makes it public; Turbot stops it. They start complaining: "hey my bucket
is not working, It's meant to be public". Why is your bucket meant to be
public? "Well I have to store the keys in it." The keys in it.? "Yeah. I'm
doing an external pipeline with using another SaaS tool needs access to
the keys, so I want to put those in a bucket, so I can reach it from
there." I mean you're kidding me... Right? But this is the sort of stuff
people do when they're just trying to solve problems. And why? Because
cloud is hard and they lack the experience and that's why we need
governance and guardrails to make it more accessible. By having good
governance and good guardrails and good rules that happen automatically;
we can drastically open the number of people who can use services and the
number of services they're allowed to use. If we have to approve it all by
hand... Check everything they're doing... Restrict them... We're just
locking down that freedom and creating a hell of a lot of work for
ourselves, which we don't need to do once we're in that world of
automation. For you more experienced users, it's all about productivity.
"Why would I go and recreate all those things?" "Why do I need to learn
that how to do it?" One of my favorite stories for that was... We had a
customer, and this was an internal one, and they came and started
complaining about their $70 bill for the month. They were just using S3...
"we're just using S3 this shouldn't be seventy dollars..." And we say
"well let's have a look" because it's governance we're multi account, we
can see what's going on. They're making 100 million requests a month to
that bucket driving a lot of costs. What was the actual real thing they
needed to do? Fix the application to stop querying. Which, by the way
which was from everybody in the company's phones draining batteries all
over the place. They fixed that, the bill goes to zero, but more
importantly everybody's phone stops draining battery. Governance drives
productivity for senior devs as well. [00:22:58][139.0
[00:22:58] Finally, the main thing is as we have this breadth of 120
services and the depth we need to cover them you need your teams to get
more and more to a Zen like state where they're comfortable with what the
clouds firing at them. That is really hard to do the pace of change here
is so high. It's scary. You've got to reinvent; you go in super excited,
you find out all that and you come away a bit overwhelmed. I'm sure many
of you had that feeling, but with good governance and software doing that,
we know we're well protected out of the gate, so we can look to experiment
and learn. From there, knowing we have that breadth and depth of coverage.
There's just no way we can build it ourselves from scratch.
[00:22:58][0.0]