- Prerequisites
- Step 1: Access AWS Console
- Step 2: Copy the CloudFormation Template
- Step 3: Create New Stack
- Step 4: Enter Stack Details
- Step 5: Configure Stack Options
- Step 6: Review and Create Stack
- Step 7: Enable Termination Protection
- Step 8: Verify Stack Creation
- Step 9: Verify Alarms Setup
- Next Steps
- Troubleshooting
Set Up Critical Alarms for Turbot Enterprise (TE)
In this guide, you will:
- Use AWS CloudFormation to set up critical alarms for Turbot Enterprise (TE)
- Monitor key metrics such as worker and factory queue backlogs
- Ensure proactive monitoring and alerting for TE instances
Monitoring your Turbot Enterprise environment is crucial for maintaining optimal performance and ensuring system reliability. By setting up critical alarms, you can proactively detect and respond to potential issues before they impact your operations. This guide will walk you through deploying a CloudFormation template to set up these alarms.
Prerequisites
- Access to the AWS account where Turbot Enterprise is deployed
- AWS IAM permissions to create CloudFormation stacks and necessary resources
- Existing Turbot Enterprise (TE) installations
- Access to the Turbot Guardrails AWS account with Administrator Privileges
Step 1: Access AWS Console
Log in to the AWS Management Console and navigate to the CloudFormation service in the region where your TE is deployed.
Step 2: Copy the CloudFormation Template
Below is the TE Critical Alarms CloudFormation Template that you will use to set up the alarms. Copy the entire template and save it as a file on your local machine with a .yml
extension (e.g., te_alarms_template.yml
).
AWSTemplateFormatVersion: "2010-09-09"
Description: Turbot Guardrails Enterprise Monitoring
Metadata: AWS::CloudFormation::Interface: ParameterGroups: - Label: default: Advanced - Infrastructure Parameters: - ResourceNamePrefix - StackResourceNamePrefix - EnterpriseVersion
ParameterLabels: # Advanced - Turbot ResourceNamePrefix: default: Resource Name Prefix
# Temporary parameter, since there are existing alarms and resources in place with ResourceNamePrefix, # adding additional prefix. StackResourceNamePrefix: default: Stack Resource Name Prefix
EnterpriseVersion: default: Enterprise Version
Parameters: ResourceNamePrefix: Description: > Name of the resource prefix used by the Turbot Guardrails Database stack, which is a prefix for exported outputs from that stack. Type: String Default: turbot AllowedPattern: "^[a-z][a-z0-9]*$"
StackResourceNamePrefix: Description: > Name of the resources prefix created by this Stack. Type: String Default: monitoring AllowedPattern: "^[a-z][a-z0-9]*$"
EnterpriseVersion: Description: Select the version of the Enterprise. Type: String Default: 5.46.0
Mappings: Constants: Turbot: EntityName: Turbot HQ Inc
Resources: Dashboard: Type: AWS::CloudWatch::Dashboard Properties: DashboardName: !Sub - "${ResourceNamePrefix}_${StackResourceNamePrefix}_${SafeVersion}_${Region}" - SafeVersion: !Join ["_", !Split [".", !Join ["_", !Split ["-", !Ref EnterpriseVersion]]]] Region: !Join ["_", !Split ["-", !Ref "AWS::Region"]] DashboardBody: !Join - "" - - | { "widgets": [
- !Join - "," - - !Sub - | { "height": 2, "width": 24, "y": 0, "x": 0, "type": "text", "properties": { "markdown": "# Events & Workers" } } - {}
- !Sub - | { "height": 6, "width": 15, "y": 2, "x": 0, "type": "metric", "properties": { "metrics": [ [ "AWS/SQS", "NumberOfMessagesReceived", "QueueName", "${WorkerQueueName}", { "stat": "Sum", "period": 300 } ], [ "AWS/SQS", "NumberOfMessagesSent", "QueueName", "${WorkerQueueName}", { "stat": "Sum", "period": 300 } ], [ "AWS/SQS", "NumberOfMessagesDeleted", "QueueName", "${WorkerQueueName}", { "stat": "Sum", "period": 300 } ], [ "AWS/SQS", "NumberOfMessagesReceived", "QueueName", "${FactoryQueueName}", { "stat": "Sum", "period": 300 } ], [ "AWS/SQS", "NumberOfMessagesSent", "QueueName", "${FactoryQueueName}", { "stat": "Sum", "period": 300 } ], [ "AWS/SQS", "NumberOfMessagesDeleted", "QueueName", "${FactoryQueueName}", { "stat": "Sum", "period": 300 } ] ], "view": "timeSeries", "stacked": false, "region": "${AWS::Region}", "title": "Events Queue Activity", "period": 300, "yAxis": { "left": { "min": 0 } } } } - WorkerQueueName: !Sub - "${ResourceNamePrefix}_${SafeVersion}_events" - SafeVersion: !Join ["_", !Split [".", !Join ["_", !Split ["-", !Ref EnterpriseVersion]]]] FactoryQueueName: !Sub - "${ResourceNamePrefix}_${SafeVersion}_events_factory" - SafeVersion: !Join ["_", !Split [".", !Join ["_", !Split ["-", !Ref EnterpriseVersion]]]]
- !Sub - | { "height": 6, "width": 9, "y": 2, "x": 15, "type": "text", "properties": { "markdown": "_The Events Queue is the heart of running controls, actions, calculated policies, notifications and all worker tasks._\n\n**Healthy**: Tasks done (green) will match exactly tasks planned (red) and tasks attempted (orange). There may be spikes in load, but generally the chart appears as a single steady line. By design, Turbot Guardrails should have very few events when there is no activity in the environment.\n\n**Overloaded**: Tasks planned (red) exceeds tasks done (green) for a sustained period, creating a backlog of work.\n\n**High Task Retry Rate**: Tasks attempted (orange) is at a sustained, high level. Retries happen occasionally (e.g. throttling) but ongoing levels usually indicate a noisy mod or control.\n" } } - {}
- !Sub - | { "height": 6, "width": 15, "y": 8, "x": 0, "type": "metric", "properties": { "metrics": [ [ "AWS/SQS", "ApproximateNumberOfMessagesDelayed", "QueueName", "${WorkerQueueName}" ], [ "AWS/SQS", "ApproximateNumberOfMessagesVisible", "QueueName", "${WorkerQueueName}" ], [ "AWS/SQS", "ApproximateNumberOfMessagesNotVisible", "QueueName", "${WorkerQueueName}" ], [ "AWS/SQS", "ApproximateNumberOfMessagesDelayed", "QueueName", "${FactoryQueueName}" ], [ "AWS/SQS", "ApproximateNumberOfMessagesVisible", "QueueName", "${FactoryQueueName}" ], [ "AWS/SQS", "ApproximateNumberOfMessagesNotVisible", "QueueName", "${FactoryQueueName}" ] ], "view": "timeSeries", "stacked": true, "region": "${AWS::Region}", "yAxis": { "left": { "min": 0 } }, "annotations": { "horizontal": [ { "label": "Number of messages visible >= 50000 for 24 datapoints within 2 hours.", "value": 50000 } ] }, "title": "Events Queue Backlog" } } - WorkerQueueName: !Sub - "${ResourceNamePrefix}_${SafeVersion}_events" - SafeVersion: !Join ["_", !Split [".", !Join ["_", !Split ["-", !Ref EnterpriseVersion]]]] FactoryQueueName: !Sub - "${ResourceNamePrefix}_${SafeVersion}_events_factory" - SafeVersion: !Join ["_", !Split [".", !Join ["_", !Split ["-", !Ref EnterpriseVersion]]]]
- !Sub - | { "height": 6, "width": 9, "y": 8, "x": 15, "type": "text", "properties": { "markdown": "_Events queue backlog shows any buildup of tasks, often visible as slower processing and handling._\n\n**Healthy**: Backlog (red) and running tasks (orange) are low (often zero) with short buildups (< 30 mins) associated with events like resource imports, mod updates, etc.\n\n**Overloaded**: Sustained backlog growth (red) indicating that ECS workers cannot keep up with incoming events. Increase capacity or check for a flood of unnecessary events.\n\n**Stalled processing**: Running tasks (orange) appear for long periods. Typically indicates a misconfigured mod failing to terminate the task.\n" } } - {}
- | ] }
EventsQueueThresholdAlarmWorker: Type: "AWS::CloudWatch::Alarm" Properties: AlarmName: !Sub - "${ResourceNamePrefix}_${StackResourceNamePrefix}_${SafeVersion}_events_worker_queue_messages_visible_alarm" - SafeVersion: !Join ["_", !Split [".", !Join ["_", !Split ["-", !Ref EnterpriseVersion]]]] AlarmDescription: "Events Queue Backlog number of messages visible alarm threshold" Namespace: "AWS/SQS" MetricName: ApproximateNumberOfMessagesVisible ComparisonOperator: "GreaterThanOrEqualToThreshold" Dimensions: - Name: QueueName Value: !Sub - "${ResourceNamePrefix}_${SafeVersion}_events" - SafeVersion: !Join ["_", !Split [".", !Join ["_", !Split ["-", !Ref EnterpriseVersion]]]] EvaluationPeriods: 24 Period: 300 Statistic: "Average" Threshold: 50000 AlarmActions: - !Sub "arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${ResourceNamePrefix}_${StackResourceNamePrefix}_alarms"
EventsQueueThresholdAlarmFactory: Type: "AWS::CloudWatch::Alarm" Properties: AlarmName: !Sub - "${ResourceNamePrefix}_${StackResourceNamePrefix}_${SafeVersion}_events_factory_queue_messages_visible_alarm" - SafeVersion: !Join ["_", !Split [".", !Join ["_", !Split ["-", !Ref EnterpriseVersion]]]] AlarmDescription: "Events Queue Backlog number of messages visible alarm threshold" Namespace: "AWS/SQS" MetricName: ApproximateNumberOfMessagesVisible ComparisonOperator: "GreaterThanOrEqualToThreshold" Dimensions: - Name: QueueName Value: !Sub - "${ResourceNamePrefix}_${SafeVersion}_events_factory" - SafeVersion: !Join ["_", !Split [".", !Join ["_", !Split ["-", !Ref EnterpriseVersion]]]] EvaluationPeriods: 24 Period: 300 Statistic: "Maximum" Threshold: 50000 AlarmActions: - !Sub "arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${ResourceNamePrefix}_${StackResourceNamePrefix}_alarms"
Step 3: Create New Stack
In the AWS CloudFormation console, click on Create stack and select With new resources (standard).
Under Specify template, choose the Upload a template file option. Click Choose file and select the template file you downloaded earlier. Click Next.
Step 4: Enter Stack Details
Provide a Stack name and enter the required parameters:
Parameter Name | Description |
---|---|
ResourceNamePrefix | Resource prefix used by the Turbot Guardrails stack. Default is turbot . |
StackResourceNamePrefix | Resource prefix for resources created by this stack. Default is monitoring . |
EnterpriseVersion | The version of Turbot Enterprise. This should match the TE version you have installed (e.g., 5.46.0 ). |
Click Next after filling in the parameters.
Step 5: Configure Stack Options
Optionally, add tags or adjust advanced options as needed. For this guide, you can proceed with the defaults. Click Next.
Step 6: Review and Create Stack
Review the stack details and ensure all parameters are correct. Acknowledge that AWS CloudFormation might create IAM resources by selecting the checkbox under Capabilities.
Click Create stack to initiate the stack creation process.
Step 7: Enable Termination Protection
It is recommended to enable Termination Protection on the CloudFormation stack to prevent accidental deletion. After the stack is created, navigate to the stack details, click on Stack actions, and select Enable termination protection.
Step 8: Verify Stack Creation
Wait for the stack status to reach CREATE_COMPLETE. This process may take several minutes.
Step 9: Verify Alarms Setup
Navigate to the CloudWatch service in the AWS Console and select Alarms from the sidebar. You should see the newly created alarms for your Turbot Enterprise environment.
- Worker Queue Alarm: Triggers when the
ApproximateNumberOfMessagesVisible
in the worker queue is greater than or equal to 50,000 for 24 data points within 2 hours. - Factory Queue Alarm: Triggers when the
ApproximateNumberOfMessagesVisible
in the factory queue is greater than or equal to 50,000 for 24 data points within 2 hours.
Next Steps
- Monitor the alarms and ensure that they are configured correctly.
- If you have additional TE installations, repeat the process for each, creating separate stacks.
- Consider integrating with your incident management system for automated alerting and response.
Troubleshooting
Issue | Description | Guide |
---|---|---|
Permission Issues | If you encounter permission errors during stack creation, ensure that your IAM user or role has the necessary permissions to create CloudFormation stacks and related resources like SNS topics. | AWS Permissions for Turbot Guardrails Administrators |
Further Assistance | If issues persist, please open a support ticket and include relevant stack logs and error messages to help us assist you effectively. | Open Support Ticket |