Okay, this is admittedly a very odd blog post. I’m giving a talk at the AWS re:Invent 2017 conference and I have lots more to say than my time slot allows (actually, that’s always true for me). In particular, I have a slide titled “Tips and lessons learned” that I think may be useful, but will not have time to walk through each point. So, I’ll use this post as a sort of “supplementary materials” section to add some color to my short bullet points.
The talk is “LFS305 - Automated Policy Enforcement for Real-time Operations, Security, and Compliance for Life Sciences”. I’m presenting this jointly with Nathan Wallace, CEO or Turbot Inc.
This is not a compressive collection of lessons learned or even a description of our path moving a research organization from pure on-prem to the cloud. I’d like to do that, but that's a story (book length really) for another day. For now, here’s just a spare collection of a few useful points related to the theme of the talk, which is about how to use software-defined operations to automate policy enforcement and therefore enable innovation.
This is not a stand-alone post, I’m going to assume that you’ve heard the presentation and have that background and context.
Here’s the slide that I’d like to now annotate:
##Transformation I didn’t talk much about our enterprise transformation from on-prem to cloud, but there a number of elements of our technical and strategic approach that were directly beneficial to moving us to a “cloud-first” organization.
Separate use cases
Core to our approach was the idea of dividing our cloud offerings into six distinct use cases. Most of these will be common to any enterprise, but you may have more or less, and some may be a little different. In any event, dividing by use case (perhaps these should be better named service categories) was particularly helpful. Of course, foremost, it allowed us to focus on getting one thing right at a time and not boil the ocean. Also, they have common elements, so solving one well meant that the next was that much easier and could build on the previous.
From a policy, controls, and risk management perspective, each use case has distinct requirements. Therefore, by grouping this way we could define control/monitoring patterns that are unique to the use case and can then be programmatically applied to all instances of that pattern. Therefore, we are not looking at every workload and analyzing and customizing in a 1:1 ratio. Instead, users use the catalog self-select the matching use case, and then request access for that use case. When we provision their account, they automatically get the appropriate controls that match what they are doing. We also get high re-use for documentation, training, terms-of-use, and other artifacts and activities. Essentially, we’ve turned a problem of scale N to a problem of a constant 6 (one for each use case).
The model permeates and facilitates all of our communications and technical designs and standards. For example, we have a bit of template code that turns a generic Turbot/AWS account into a control structure for each use case by executing a single command. Also, we have a naming standard so that by simply looking at the first letter of the Turbot account ID, one knows exactly what type of account is being used and all of the guardrail and compliance implications for that account. This is a lot like the different types of license plates, where the class (e.g., passenger vehicle, commercial, motor cycle, official, livery) is readily discernible and there is a built-in reference to the applicable rules.
The first use case that we implemented was the Tech Lab. Tech Labs are sandboxes where we let people explore as much as they want (mostly) with as few controls as possible. In exchange for this freedom, we prohibit use of any non-public data and almost completely restrict access to the internal network. This is of course an important use case, but I highlight it for a couple of reasons. First, it was a good and safe place to start. As noted above, taking one use case at a time helps the learning process and subsequent use cases are fairly small variants on the previous.
The more interesting side effect is that making the Tech Labs available early and easily was instrumental in the ability to transform our internal development community from a pure on-prem group to what is now an amazing team of cloud savvy engineers that are building cloud-native solutions. On our “certification wall” we have a very large and growing number of developers and administrators that have got AWS certificates. All of this transformation was fueled by making it easy (trivial - it takes 3 clicks through a portal we created to get immediate access to a Tech Lab) to do hands on learning, exploration, and experimentation. Of course, we coupled this with all sorts of other org change tools such as setting vision, strategy, communications, motivations, training, and lots of enthusiasm. However, all that would have been wasted if it were not easy and safe for people to try things out. I believe that ultimately the cloud sells itself to a good engineer once they start trying things hands on. It’s just fun and productive.
As Nathan said during his portion of the talk, “don’t abstract the console.” By using automated controls (Turbot in our case), we are able to provide engineers direct AWS console access. If you believe what I said above about the cloud selling itself and it being fun and productive for engineers, then abstracting the console or boxing in the creativity of developers will surely kill a lot of that fun, value, and agility. Allowing direct console access, and by that I equally mean access to the AWS APIs (via API calls, AWS CLI, Cloud Formation, or Terraform), was as critically required for our organization's transformation as providing the Tech Labs.
There is a cost of unbounded “console access”, especially to new cloud users. There’s a lot there! This can easily intimidate and scare people away, or paralyze them with too many choices. This needs to be managed. We did this by documenting lists of services people should focus on and a list of services that you shouldn’t even look at (either not allowed or not likely applicable). We also directly addressed this topic as part of our training so that we could get ahead of it and steer appropriately.
We made sure to directly embed Cyber Security, Information Security, and Information Governance representatives into our charter cloud team. This is obvious (though often missed/delayed nonetheless). However, I mention it here because with a software-define operations and compliance model, it is critical and in no way optional. What we are building is a way to use software automation to replace manual and human processes so we can achieve real-time continuous compliance that scales seamlessly. So if what we are building is a machine to automate the activities of our subject matter experts, then obviously they must be part of the design and evolution and not a post facto reviewer.
For full disclosure, the above makes it sounds like we actually got this right from the start. In reality, I didn’t quite grok this simple paradigm when we were starting out and I was in the middle of it. I was inclusive, but did not give clear direction about this relationship until after we had learned a lot. If I were starting over, I would create objectives that made the requirements and design role (as opposed to consultants and reviewers) of the governance teams clear at the outset.
I spent much of my talk discussing the benefits of our multi-account strategy. Therefore, I’ll keep this section brief and treat it more as a recapitulation of what was presented in the talk. Suffice to say, after having started with a small and fixed number of accounts, learning, and then pivoting to our present multi-account strategy, I’m strongly convinced that for enterprises with a large variety of workloads, then a multi-account approach is a dramatically superior model.
Essential for diversity
If you have a relatively homogeneous collection of workloads, then you may not require a multi-account approach. However, if you have a highly diverse collection of solutions, then using the “account as a container” approach is a powerful tool.
The only reason not to use a multi-account approach is that if you have a lot of accounts, you need a way to manage them and ensure the necessary controls and compliance with enterprise policies and legal requirements. The entire AWS environment is controllable and monitorable by an incredibly rich suite of API-accessible services. This is the basis for automating management of multiple accounts. You can write your own multi-account automation or buy something off the shelf. We started with writing it ourselves, that helped to understand the scope of the task, and then ultimately went with a buy option. You have choices for how, but in all cases, you will need a robust automation solution as your investment to enable a multi-account approach.
As a side note, if you stick with a single (small number) account approach, you still need automation. It’s just in a different dimension and tends to require a more centralized and disempowered (for development teams) approach.
API access important
There are layers of multi-account automation. It’s not sufficient to have a single tier of multi-account management. That just won’t be flexible enough. We found that we get great benefit by having multiple tiers (often just one extra layer). That is, there will always be needs for customization and we found that we’ve developed and rely heavily on a layer of automation that interfaces between our business and management needs and the cloud management framework.
In our case, Turbot provides an API so that we have RESTful access to everything that can be done in the Turbot console. We’ve used this API access to write our own tools for specific reporting, monitoring, and configuration needs. This also has made it possible for us to automate our processes. For example, we wrote a simple portal for all of our cloud users. One menu pick is to “Join a Tech Lab”. After filling out the form, the app does a little business logic and record keeping, and then makes a call to Turbot to automatically give the requestor access to a Tech Lab. This is a fully self-service model enabled because our cloud management tier is fully API accessible.
It’s not much of a surprise that when your compute and storage is remote from your users, engineers, and data producing instruments, networking might be kind of important. Even so, in hindsight getting our networking right was much more of a foreground activity that I had anticipated.
Networking is hard
It’s probably not worth explaining this in detail since everyone’s situation will be different. However, the big takeaway is that one should not take networking for granted and that it can get a lot more complicated and expensive than you might think at the start. In retrospect, perhaps if we were more effective at requirements management, network architecture reviews, and running more experiments, we might have had a smoother transition. Exacerbating factors include the fact that we have a very complex internal network that has evolved over the years, we don’t manage most of it directly, and our multi-account strategy adds unique demands on the network.
Here’s a sample of some of the issues that we faced. We needed to get multiple “/16” CIDR ranges (64k IPs each) allocated. Those are not readily doled out, and this took time and effort. Our initial routers needed to be replaced (with significant expense) since with a multi-account strategy the number of BGP sessions is large. The root cause of this was that we had a suitable solution for our initial small number of accounts approach, but when we pivoted to the multi-account strategy, we did not revisit how this would impact our networking. Note that our bandwidth requirements didn’t change, but the BGP session count requirements moved from a handful to hundreds (“we’re gonna need a bigger router”).
We had logistic issues too. The cloud is great, it’s all software, right? Except, that for our Direct Connects we still needed hardware. We have an elegant and effective solution, but it did require us to procure data center (colo) space at Amazon’s edge so that we could install firewalls, routers, IDS/IPS, do all this with redundancy, and coordinate the provisioning with our WAN supplier, our networking teams, and AWS. Add to this, interesting things like when we wanted to ship gear to another country where we don’t have offices, and therefore had challenges to manage the taxes associated with the hardware (since we wanted to first configure at our US site and then ship to the final destination). You get the picture.
VIF (and other) limits
As noted above, the large number of accounts (hundreds) that we use is a great enabler, but does put stress on networking infrastructure, not for bandwidth, but due to the large number of interfaces required. As with all AWS services, there are a number of soft and hard limits for the AWS Direct Connect service. It will be important to work with your solution architect to take these limits into account when designing your solution. AWS is always evolving their solutions, so our issues may not be yours, but for sure there will always be constraints that you’ll need to consider since the use case of a large number of small network usage accounts sharing a small number of Direct Connects is not yet a mainstream use case.
Continuing the theme of the stresses that a multi-account strategy uniquely generates, you’ll need to have some deliberate strategies for IP allocation. In our case, we use a single /16 (64k) CIDR in each region. It’s a big deal (changes in many places let alone getting a new /16 allocated) if we run out and need to expand, so I consider this a fixed resource to manage. That’s a lot of IPs for a research organization, but with a multi-account strategy, if not actively managed, the fragmentation will chew through these fast. From the start, we’ll consume say a quarter of these (16k) for use with HPC and other distributed computing use cases. Just as an example, if we gave each account a typical /24 (256 IPs), then 48k/256 = 192 accounts. However, we expect to stabilize in the vicinity of about 500 accounts for our particular organization.
Therefore, our default model is to allocate only a single VPC with a /26 (64 IPs) CIDR. This is good for almost all requests. If a given solution requires more, we’ll give them what they need, as a (slightly) custom configuration. This allows us to have up to 768 accounts in a single region. We get a little more flexibility since we actually split our accounts across regions based on need, so not all accounts need VPCs in all regions.
Note that you will also need to be mindful of how services like EMR and Lambda consume IPs. For example, Lambdas running outside of your VPC are fine (which is secure and what you should do if they don’t need direct access to your VPC resources), but if running in your VPC (in case they need access to your VPC resources), then they will consume IPs based on the number of concurrent Lambda functions executing. Unofficially, for us it seems to be that about every five Lambdas running consumes one IP.
Like I said, to me it is odd to provide a random set of tips and learnings, but I figured it would be better to broadcast a few things that I think might help others rather than construct a full linear model with all learnings along the way, that I’ll never write and no one would ever read anyway.
If you have questions about any of this or other topics not covered, please comment, and time-permitting, I’ll try to reply. Even better, if you have your own observations, feel free to add them as comments to this post.
Aren’t you glad that I didn’t try to squeeze all this in to a live presentation?!