Set Up Critical Alarms for Turbot Enterprise Database (TED)

In this guide, you will:

Use AWS CloudFormation to set up critical alarms for Turbot Enterprise Database (TED)
Monitor key metrics such as CPU utilization, database connections, and cache evictions
Ensure proactive monitoring and alerting for TED instances

Monitoring your Turbot Enterprise Database is crucial for maintaining optimal performance and ensuring system reliability. By setting up critical alarms, you can proactively detect and respond to potential issues before they impact your operations. This guide will walk you through deploying a CloudFormation template to set up these alarms.

Prerequisites

Access to the AWS account where Turbot Enterprise Database is deployed
AWS IAM permissions to create CloudFormation stacks and necessary resources
Existing Turbot Enterprise Database (TED) installations
Access to the Turbot Guardrails AWS account with Administrator Privileges

Step 1: Access AWS Console

Log in to the AWS Management Console and navigate to the CloudFormation service in the region where your TED is deployed.

Step 2: Copy the CloudFormation Template

Below is the TED Critical Alarms CloudFormation Template that you will use to set up the alarms. Copy the entire template and save it as a file on your local machine with a .yml extension (e.g., ted_alarms_template.yml).

AWSTemplateFormatVersion: "2010-09-09"

Description: Turbot Guardrails Enterprise Database Monitoring

Metadata:
  AWS::CloudFormation::Interface:
    ParameterGroups:
      - Label:
          default: Hive Configuration
        Parameters:
          - HiveName
          - PrimaryRegion

      - Label:
          default: Database - Advanced - Encryption
        Parameters:
          - KeyAliasSsmValue

      - Label:
          default: Database - Advanced - Parameters
        Parameters:
          - MaxConnections
          - MaxConnectionsAlarmThreshold

      - Label:
          default: Advanced - Infrastructure
        Parameters:
          - ResourceNamePrefix
          - StackResourceNamePrefix

    ParameterLabels:
      # Hive Configuration
      HiveName:
        default: Database Hive Name
      PrimaryRegion:
        default: Primary Region

      KeyAliasSsmValue:
        default: Key Alias Parameter
      KMSKeyForPerformanceInsights:
        default: KMS Key For RDS Instance Performance Insights

      # Advanced - Parameters
      MaxConnections:
        default: Maximum number of concurrent connections
      MaxConnectionsAlarmThreshold:
        default: Alarm threshold for maximum number of concurrent connections

      # Advanced - Turbot
      ResourceNamePrefix:
        default: Resource Name Prefix

      StackResourceNamePrefix:
        default: Stack Resource Name Prefix

Parameters:
  HiveName:
    Description: Name For the Database Hive.
    Type: String
    Default: newton
    AllowedPattern: "^[a-z][a-z0-9_-]*[a-z0-9]$"

  PrimaryRegion:
    Description: Region where the primary Database currently resides. If set to empty, Turbot Guardrails will use the Alpha region set by TEF as the database's primary region.
    Type: String
    Default: ""
    AllowedValues:
      - ""
      - ap-northeast-1
      - ap-northeast-2
      - ap-northeast-3
      - ap-south-1
      - ap-southeast-1
      - ap-southeast-2
      - ca-central-1
      - cn-north-1
      - cn-northwest-1
      - eu-central-1
      - eu-north-1
      - eu-west-1
      - eu-west-2
      - eu-west-3
      - sa-east-1
      - us-east-1
      - us-east-2
      - us-west-1
      - us-west-2
      - us-gov-west-1
      # US Gov East 1 is not supported, however this is left here for development purpose
      - us-gov-east-1

  FoundationAlphaRegion:
    Description: Alpha region specified in TEF.
    Type: AWS::SSM::Parameter::Value<String>
    Default: "/turbot/enterprise/alpha_region"

  KeyAliasSsmValue:
    Description: |
      KMS Key alias defined in Turbot Guardrails Enterprise Foundation.
      YOU SHOULD ONLY CHANGE THIS PARAMETER IF YOU USED A NON-DEFAULT PREFIX IN THE TEF STACK
    Type: AWS::SSM::Parameter::Value<String>
    Default: "/turbot/enterprise/foundation_key_alias"

  MaxConnections:
    Description: Sets the maximum number of concurrent connections.
    Type: Number
    MinValue: 6
    MaxValue: 8388607
    Default: 600

  MaxConnectionsAlarmThreshold:
    Description: Sets the alarm threshold for maximum number of concurrent connections.
    Type: Number
    MinValue: 6
    MaxValue: 8388607
    Default: 500

  ResourceNamePrefix:
    Description: >
      Name of the resource prefix used by the Turbot Guardrails Database stack, which
      is a prefix for exported outputs from that stack.
    Type: String
    Default: turbot
    AllowedPattern: "^[a-z][a-z0-9]*$"

  StackResourceNamePrefix:
    Description: >
      Name of the resources prefix created by this Stack.
    Type: String
    Default: monitoring
    AllowedPattern: "^[a-z][a-z0-9]*$"

  ParameterDeploymentTrigger:
    Description: >
      Changes to SSM parameter overrides (e.g. IAM role ARNs) are not automatically detected by CloudFormation.
      Upgrades will recalculate the parameters, but if you wish to refresh you parameters without upgrading you can toggle this parameter.
      Simply changing it is enough to force the parameters to be re-read and recalculated.
    Type: String
    Default: Blue
    AllowedValues:
      - Blue
      - Green

Mappings:
  Constants:
    Turbot:
      EntityName: Turbot HQ Inc

    Product:
      Name: Turbot Enterprise Database
      Version: ${tedVersion}

    RequiredVersion:
      TEF: ${requiredTefVersion}

Conditions:
  UseFoundationAlphaRegion: !Equals [!Ref PrimaryRegion, ""]
  IsFoundationPrimary: !Equals [!Ref FoundationAlphaRegion, !Ref "AWS::Region"]
  IsTEDRegionPrimary: !Equals [!Ref PrimaryRegion, !Ref "AWS::Region"]

  IsPrimary: !Or
    - !And
      - Condition: IsFoundationPrimary
      - Condition: UseFoundationAlphaRegion
    - !And
      - Condition: IsTEDRegionPrimary
      - !Not
        - Condition: UseFoundationAlphaRegion

Resources:
  HiveSNSTopic:
    Type: "AWS::SNS::Topic"
    Properties:
      TopicName: !Sub "${ResourceNamePrefix}_${StackResourceNamePrefix}_alarms"
      KmsMasterKeyId: !Ref KeyAliasSsmValue

  HiveDashboard:
    Type: "AWS::CloudWatch::Dashboard"
    Properties:
      DashboardName: !Sub
        - "${ResourceNamePrefix}_${StackResourceNamePrefix}_ted_${HiveName}_${Region}"
        - HiveName: !Ref HiveName
          Region: !Join ["_", !Split ["-", !Ref "AWS::Region"]]
      DashboardBody: !Join
        - ""
        - - |
            { "widgets": [

          - !Join
            - ","
            - - !Sub
                - |
                  {
                    "type": "metric",
                    "x": 0,
                    "y": 0,
                    "width": 15,
                    "height": 6,
                    "properties": {
                      "metrics": [
                        [ "AWS/RDS", "CPUUtilization", "DBInstanceIdentifier", "${PrimaryHive}", { "id": "m1", "color": "#1f77b4" } ]
                      ],
                      "annotations": {
                        "horizontal": [
                            {
                              "label": "CPUUtilization >= 90 for 12 datapoints within 1 hour",
                              "value": 90
                            }
                        ]
                      },
                      "yAxis": {
                        "left": {
                            "min": 0
                        }
                      },
                      "view": "timeSeries",
                      "stacked": false,
                      "title": "CPU",
                      "region": "${Region}",
                      "period": 5
                    }
                  }
                - PrimaryHive: !Join ["-", !Split ["_", !Sub "${ResourceNamePrefix}_${HiveName}"]]
                  Region: !Ref "AWS::Region"

              - !Sub
                - |
                  {
                    "type": "text",
                    "x": 15,
                    "y": 0,
                    "width": 9,
                    "height": 6,
                    "properties": {
                        "markdown": "_CPU utilization indicates how hard the Hive is working to process requests._\n\n**Healthy:** Consistent load with some spikes.\n\n**Overloaded:** The Hive CPU is overloaded when it is consistently above 50% or higher.\n\n**Under-provisioned:** When there are no errors in Turbot Guardrails operations and CPU is very high, the Hive instances may be too small for the workload.\n\n**Over-provisioned:** When CPU is consistently very low and the largest load spikes are below 50%.\n"
                    }
                  }
                - {}

              - !Sub
                - |
                  {
                    "type": "metric",
                    "x": 0,
                    "y": 6,
                    "width": 15,
                    "height": 6,
                    "properties": {
                      "metrics": [
                        [ "AWS/RDS", "DatabaseConnections", "DBInstanceIdentifier", "${PrimaryHive}", { "id": "m1", "color": "#1f77b4" } ]
                      ],
                      "yAxis": {
                        "left": {
                            "showUnits": true,
                            "min": 0,
                            "max": ${Max_Connections}
                        }
                      },
                      "annotations": {
                        "horizontal": [
                            {
                              "label": "Number of connections alarm threshold.",
                              "value": ${Max_Connections_AlarmThreshold}
                            }
                        ]
                      },
                      "view": "timeSeries",
                      "stacked": false,
                      "region": "${Region}",
                      "period": 300,
                      "title": "Connections"
                    }
                  }
                - PrimaryHive: !Join ["-", !Split ["_", !Sub "${ResourceNamePrefix}_${HiveName}"]]
                  Region: !Ref "AWS::Region"
                  Max_Connections: !Ref MaxConnections
                  Max_Connections_AlarmThreshold: !Ref MaxConnectionsAlarmThreshold

              - !Sub
                - |
                  {
                    "type": "text",
                    "x": 15,
                    "y": 6,
                    "width": 9,
                    "height": 6,
                    "properties": {
                        "markdown": "_Connection counts should roughly correlate with the number of ECS Tasks and invoked Lambdas._\n\n**Healthy:** Connections should slowly churn over time as Lambdas and Turbot Guardrails Tasks spin up and down.\n\n**Abnormal Spike:** An abrupt spike in connections may indicate a failure in the Tasks or Lambdas that caused lots of reconnections.\n\n**Connections Flood:** Should the connections count continually increase over time, this may indicate stale connections from processes that can't finish.\n"
                    }
                  }
                - {}

              - !Sub
                - |
                  {
                    "type": "metric",
                    "x": 0,
                    "y": 86,
                    "width": 15,
                    "height": 6,
                    "properties": {
                        "view": "timeSeries",
                        "stacked": false,
                        "metrics": [
                          [ "AWS/ElastiCache", "DatabaseMemoryUsagePercentage", { "yAxis": "right" } ],
                          [ "AWS/ElastiCache", "SwapUsage" ]
                        ],
                        "yAxis": {
                          "right": {
                              "min": 0
                          }
                        },
                        "region": "${Region}",
                        "title": "Cache Swap Usage & Memory Usage Percentage",
                        "period": 300,
                        "stat": "Average"
                    }
                  }
                - Region: !Ref "AWS::Region"

              - !Sub
                - |
                  {
                    "type": "text",
                    "x": 15,
                    "y": 86,
                    "width": 9,
                    "height": 6,
                    "properties": {
                        "markdown": "_ElastiCache Swap Usage and Database Memory Usage Percentage._\n\n**Swap Usage:** The amount of swap used on the host (in MB).\n\n**Database Memory Usage Percentage:** Percentage of memory utilization, based on the current memory utilization (BytesUsedForCache) and the maxmemory. Maxmemory sets the maximum amount of memory for the dataset."
                    }
                  }
                - {}

              - !Sub
                - |
                  {
                    "type": "metric",
                    "x": 0,
                    "y": 86,
                    "width": 15,
                    "height": 6,
                    "properties": {
                        "view": "timeSeries",
                        "stacked": false,
                        "metrics": [
                          [ "AWS/ElastiCache", "Evictions" ]
                        ],
                        "yAxis": {
                          "right": {
                              "min": 0
                          }
                        },
                      "annotations": {
                        "horizontal": [
                            {
                              "label": "Number of evictions alarm threshold.",
                              "value": 500000
                            }
                        ]
                      },
                        "region": "${Region}",
                        "title": "Cache Evictions",
                        "period": 300,
                        "stat": "Sum"
                    }
                  }
                - Region: !Ref "AWS::Region"

              - !Sub
                - |
                  {
                    "type": "text",
                    "x": 15,
                    "y": 86,
                    "width": 9,
                    "height": 6,
                    "properties": {
                        "markdown": "_ElastiCache Evictions._\n\n**Evictions:** This metric represents the number of non-expired items that the cache evicted due to memory constraints to allow space for new writes. For ElastiCache Redis, this is derived from the evicted_keys statistic at Redis INFO."
                    }
                  }
                - {}

          - |
            ] }

  HiveCPUUtilAlarm:
    Type: "AWS::CloudWatch::Alarm"
    Condition: IsPrimary
    Properties:
      AlarmDescription: "Database CPU Utilization alarm threshold"
      AlarmActions:
        - !Sub "arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${ResourceNamePrefix}_${StackResourceNamePrefix}_alarms"
      AlarmName: !Sub "${ResourceNamePrefix}_${StackResourceNamePrefix}_${HiveName}_cpu_util_alarm"
      ComparisonOperator: "GreaterThanOrEqualToThreshold"
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Join ["-", !Split ["_", !Sub "${ResourceNamePrefix}-${HiveName}"]]
      EvaluationPeriods: 12
      MetricName: CPUUtilization
      Namespace: AWS/RDS
      Period: 300
      Statistic: Average
      Threshold: 90
      TreatMissingData: missing

  HivePrimaryMaxDBConnectionThresholdAlarm:
    Type: "AWS::CloudWatch::Alarm"
    Condition: IsPrimary
    Properties:
      AlarmActions:
        - !Sub "arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${ResourceNamePrefix}_${StackResourceNamePrefix}_alarms"
      AlarmName: !Sub "${ResourceNamePrefix}_${StackResourceNamePrefix}_${HiveName}_db_max_connections_alarm"
      ComparisonOperator: "GreaterThanOrEqualToThreshold"
      Dimensions:
        - Name: DBInstanceIdentifier
          Value: !Join ["-", !Split ["_", !Sub "${ResourceNamePrefix}-${HiveName}"]]
      EvaluationPeriods: 3
      MetricName: DatabaseConnections
      Namespace: AWS/RDS
      Period: 300
      Statistic: Maximum
      Threshold: !Ref MaxConnectionsAlarmThreshold
      TreatMissingData: missing

  HiveElastiCacheEvictionsThresholdAlarmNodeOne:
    Type: "AWS::CloudWatch::Alarm"
    Properties:
      AlarmActions:
        - !Sub "arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${ResourceNamePrefix}_${StackResourceNamePrefix}_alarms"
      AlarmName: !Sub "${ResourceNamePrefix}_${StackResourceNamePrefix}_${HiveName}_elasticache_evictions_alarm_1"
      ComparisonOperator: "GreaterThanOrEqualToThreshold"
      Dimensions:
        - Name: CacheClusterId
          Value: !Join ["-", !Split ["_", !Sub "${ResourceNamePrefix}-${HiveName}-cache-cluster-001"]]
      EvaluationPeriods: 3
      MetricName: Evictions
      Namespace: AWS/ElastiCache
      Period: 300
      Statistic: Sum
      Threshold: 500000
      TreatMissingData: missing

Step 3: Create New Stack

In the AWS CloudFormation console, click on Create stack and select With new resources (standard).

Under Specify template, choose the Upload a template file option. Click Choose file and select the template file you downloaded earlier. Click Next.

Step 4: Enter Stack Details

Provide a Stack name and enter the required parameters:

Parameter Name	Description
HiveName	The name of the database hive. This should match the hive name used in your TED installation.
PrimaryRegion	The primary region where the database resides. Leave empty to use the default alpha region.
KeyAliasSsmValue	The KMS Key alias defined in Turbot Guardrails Enterprise Foundation. Leave as default unless you used a non-default prefix in the TEF stack.
MaxConnections	Sets the maximum number of concurrent database connections as defined in TED installation. Default is `600`.
MaxConnectionsAlarmThreshold	Sets the alarm threshold for maximum number of concurrent connections. Default is `500`.
ResourceNamePrefix	Resource prefix used by the Turbot Guardrails Database stack. Default is `turbot`.
StackResourceNamePrefix	Resource prefix for resources created by this stack. Default is `monitoring`.
ParameterDeploymentTrigger	Toggle this parameter if you wish to refresh your parameters without upgrading.

Click Next after filling in the parameters.

Step 5: Configure Stack Options

Optionally, add tags or adjust advanced options as needed. For this guide, you can proceed with the defaults. Click Next.

Step 6: Review and Create Stack

Review the stack details and ensure all parameters are correct. Acknowledge that AWS CloudFormation might create IAM resources by selecting the checkbox under Capabilities.

Click Create stack to initiate the stack creation process.

Step 7: Enable Termination Protection

It is recommended to enable Termination Protection on the CloudFormation stack to prevent accidental deletion. After the stack is created, navigate to the stack details, click on Stack actions, and select Enable termination protection.

Step 8: Verify Stack Creation

Wait for the stack status to reach CREATE_COMPLETE. This process may take several minutes.

Step 9: Verify Alarms Setup

Navigate to the CloudWatch service in the AWS Console and select Alarms from the sidebar. You should see the newly created alarms for your Turbot Enterprise Database.

Database CPU Utilization: Triggers when CPU utilization is greater than or equal to 90% for 12 data points within 1 hour.
Database Max Connections: Triggers when database connections are greater than or equal to the specified threshold for 3 data points within 15 minutes.
ElastiCache Evictions: Triggers when cache evictions are greater than or equal to 500,000 for 3 data points within 15 minutes.

Step 10: Set Up Notifications

By default, the CloudFormation template creates an SNS topic for alarm notifications. You can subscribe to this topic to receive email alerts.

Navigate to the SNS service in the AWS Console.
Find the topic named according to your resource prefixes (e.g., turbot_monitoring_alarms).
Click Create subscription and enter your email address.
Confirm the subscription via the email you receive.

Next Steps

Monitor the alarms and ensure that they are configured correctly.
If you have additional TED installations, repeat the process for each, creating separate stacks.
Consider integrating with your incident management system for automated alerting and response.

Troubleshooting

Issue	Description	Guide
Permission Issues	If you encounter permission errors during stack creation, ensure that your IAM user or role has the necessary permissions to create CloudFormation stacks and related resources like SNS topics.	AWS Permissions for Turbot Guardrails Administrators
Further Assistance	If issues persist, please open a support ticket and include relevant stack logs and error messages to help us assist you effectively.	Open Support Ticket