A few weeks ago I was a guest on Cory Quinn's podcast: "Screaming in the Cloud", we discussed (among other topics) tagging controls as a gateway to more complex governance policies. In this article I will jump in and discuss pros and cons of common resource tagging governance strategies we see across our customers.

Governance Control Objective: Ensure all cloud resources have a minimal set of required tags.

Business Objective: Tags are most often used to identify ownership and control of resources; not having specific tags (e.g. Cost Center) may lead to an inability to bill appropriately for consumed services. An effective tagging strategy can also speed troubleshooting in large environments and can be used by other governance controls as metadata for policy decisions.

Most Common Required Tags:

  1. Cost Center: Typically, free form but sometimes from a list of approved values.
  2. Environment: e.g. Development, Staging, Test, QA, Production...
  3. Owner: Typically email or employee id, sometimes a DL or other Group Identifier.
  4. Data Classification: e.g. PII, PCI, Highly Restricted, Restricted, Private, Public...
  5. Business Unit: e.g. Sales, R&D, Operations...

Strict Approach:

Delete any (taggable) resource that does not have required tags/values. This approach ensures that you always have some tag value (if not always a correct tag value), but it is difficult to retrofit into existing environments if you have large numbers of resources that need to be remediated (also if you do a lot of mergers/acquisitions).

Another potential concern is that when cloud service providers introduce new features, they don't always come with tag support, often it is added after the fact; you need to decide how you will handle untaggable resources and have a process for allowing teams to catch up on tagging once services support them. You also need to deal with resources that don't allow deletion (e.g. you can't delete a bucket with data in it, or a VPC with deployed resources.), so you will also need a way to generate alerts and follow up on resource types where you can't take removal actions.

Watchout: Keep in mind that all governance controls that can delete resources, need to be under strict source control and automated testing. You need to have a change process that protects against malicious insiders and coding errors from cascade deleting all of your cloud data.

When using Turbot to automate these types of enforcements I generally like to take the following approach:

  1. Allow 1-day grace period to tag resources (e.g. any resource not tagged will be flagged for deletion in 24 hours)
  2. Once flagged for deletion there is a grace period (e.g. 1-10 days before the deletion action takes place); combining this approach with nagging notifications, should prevent accidental deletion while still keeping the enforcement a deterrent to lazy infrastructure as code practices.

Default Value Approach:

Any (taggable) cloud resource that is created without required tags will have a default set of tags applied to them based on who (or where) they are created.

This approach is slightly more pragmatic and safer to implement across heterogenous environments. The idea is that you maintain metadata for cloud base resources (e.g. AWS Accounts, GCP Projects, Azure Subscriptions) and also maintain metadata for each authorized user.

When someone creates an untagged resource, the automation looks up the appropriate default value (e.g. Environment: Dev) for all resources in a "development" account. In addition, each user who creates resources can be tied back to a default cost center and becomes the default "owner" of the resource if not tagged otherwise.

This model is very powerful and avoids a lot of the pitfalls of the strict model, however, it does require you to have accessible metadata for each cloud landscape and for users. You also need to decide what to do when a user enters an invalid tag value...

In this scenario, we have three required tags: ['owner, 'cost-center', 'environment']. Let's say the user creates a VM with the following tags:

{
  owner: Barney,
  cost center: GNB1234,
  environment: sandbox
}

when what you really wanted is:

{
  owner: barney.stinson@gnb.com,
  cost_center: gnb1234,
  environment: sbx
}

The easiest thing to do would be to mark invalid and missing tags, which would leave the work to the resource owner to clean up the invalid and missing values:

{
  owner: __invalidvalue__( Barney ),
  cost-center: __missingtag__
  cost_center: abc123,
  environment: __invalidvalue__( sandbox ),
}

To reduce the amount of manual work and cleanup to do (e.g. reduce the number of alerts and tickets related to tagging) you can use slightly more complex logic to cleanup tag values that are close but not perfect. The most common tagging issue we see are:

  1. Incorrect case: e.g. 'Dev' instead of 'dev'
  2. Misspelled words: e.g. enviroment instead of environment
  3. Incorrect hyphenation/spaces e.g. 'cost_center', 'cost-center', 'costCenter', 'cost center'
  4. Incorrect abbreviations: (e.g. 'prod', 'prd')
  5. Multiple Combinations of the above

When setting tagging standards, we recommend using lower case and stripping special char for both the key and value. This makes it easier to write conditions for matching and also helps mitigate differences between clouds/service with regard to acceptable characters for keys and values. Consider the following pseudo code for tag matching:

requiredTags =  {
  environment: [ "dev", "test", "prod" ],
  costcenter: ...
}
alternates = {
  "devel": "dev",
  "development": "dev",
  ...
}

currentTags = $resource.tags
processedTags = {}

for key, value in current_tags {
  processedTags[key.lower().strip()] = value.lower().strip()
}

for reqKey, reqValues in requiredTags {
  if reqKey in processedTags {
    if processedTags[reqKey] in reqValues {
      ### Tag is good
    } elif alternates[processedTags[reqKey]] in reqValues {
      ### Tag is misspelled
      unset(reqKey)  ## remove the current tag
      set(reqKey, alternates[processedTags[reqKey]])  ## set to correct value
    } else {
      ## not a valid value
      set(reqKey, "__invalidvalue__")
    }
  } elif alternates[reqKey] in reqValues {
    ## key matches alternate spelling
    ## remove the current tag
    ## set correct key and value
  } else {
    ## Required key not found
    set(reqKey, "__missingkey__")
  }
}

Using lowercase matching and stripping whitespace/special char will catch the vast majority of mistakes, adding logic to catch misspellings and alternate abbreviations will take you even farther. The next step in improving the maturity of the tagging automation would be to insert additional logic (typically via a regex) to test for correct formatting of specific fields (e.g. email).

I love it when a customer has a really difficult tagging problem that they haven't been able to solve with other tools and we blow them away with a few lines of Nunjucks; Turbot's built in tagging automation templates really give you tagging superpowers! The best part is that you get to write this logic once, and Turbot automatically applies it to hundreds of different types of cloud resources across Azure, GCP and AWS.

Do you enjoy cloud governance topics like this? We do too, and we love to talk about them: Subscribe to our CTO Newsletter, for best practices, tips and industry perspectives.