- Prerequisites
- Step 1: Access AWS Console
- Step 2: Navigate to Dashboards
- Step 3: Select Dashboard
- Step 4: View Events Queue
- Step 5: Identify Noisy Tenant
- Step 6: Analyze Log Insights
- Step 7: External Messages by Accounts in a Tenant
- Step 8: External Messages by Source for a Tenant
- Step 9: External Messages by Event Name
- Step 10: Measures To Fix Event Flood
Investigate Event Flood
In this guide, you will:
- Use AWS console to investigate an event flood.
An event flood occurs when there is a substantial and sustained backlog of events in the Turbot Guardrails Events queue. Common causes include rogue automation, excessive workloads that exhaust API quotas, conflicting or misconfigured Guardrails policies. Resolving an event flood requires thorough investigation and addressing the underlying root cause.
The best way to identify an event flood is by checking the Events Queue Backlog graph in the Turbot Guardrails Enterprise (TE) CloudWatch dashboard. Prolonged control processing times and CloudWatch alarms are also common indicators of an event flood.
Prerequisites
- Access to the Guardrails AWS account with Administrator privileges.
- Familiarity with AWS Console and CloudWatch service.
Step 1: Access AWS Console
Open the AWS Console and navigate to the CloudWatch service in the region where Guardrails is deployed.
Step 2: Navigate to Dashboards
Choose Dashboards from the left navigation menu.
Step 3: Select Dashboard
In Custom dashboards, select the Turbot Guardrails Enterprise (TE) CloudWatch dashboard, which is typically named after the TE version in use.
Step 4: View Events Queue
Select the desired duration from the time range option in the top-right corner, and check the Events Queue Backlog graph in the TE CloudWatch dashboard for spikes indicating a event flood state.
Step 5: Identify Noisy Tenant
Scroll down in the same dashboard page to the Activities section, use the View All Messages By Workspace widget to filter and identify the noisy tenant causing the issues. The number of messages received by the top tenant over a specified duration, along with the difference between the top three tenants, can be a strong indicator of an event flood.
Step 6: Analyze Log Insights
With the workspace identified from the above step, navigate to CloudWatch > Logs Insights, select the appropriate worker log group for the TE version(s) and choose the desired query duration to proceed to investigate further by analyzing events, event sources, and account IDs for the workspace. This will render the query editor with the selected log group(s).
ImportantLonger durations will increase the log group size and query time, which may result in higher billing costs for CloudWatch.
NoteYou can select multiple TE version log groups if required.
Step 7: External Messages by Accounts in a Tenant
In the query editor, use the below query to identify AWS AccountId(s)
contributing to the events.
fields @timestamp, @message| filter message='received SQS message' and ispresent(data.msgObj.payload.account) and data.msgObj.meta.tenantId='your-tenant-here.cloud.turbot.com'| filter data.msgObj.type='event.turbot.com:External'| stats count() as Count by data.msgObj.meta.tenantId as Tenant, data.msgObj.payload.account as AccountId| sort Count desc | limit 5
Step 8: External Messages by Source for a Tenant
Use below query to identify specific event Source
from the different services.
fields @timestamp, @message| filter message='received SQS message' and ispresent(data.msgObj.payload.account) and ispresent(data.msgObj.payload.source) and data.msgObj.meta.tenantId='your-tenant-here.cloud.turbot.com'| filter data.msgObj.type='event.turbot.com:External'| stats count() as Count by data.msgObj.meta.tenantId as Tenant, data.msgObj.payload.source as Source| sort Count desc | limit 5
Step 9: External Messages by Event Name
Use below query to identify the specific EventName
associated with the service.
fields @timestamp, @message| filter message='received SQS message' and ispresent(data.msgObj.payload.account) and ispresent(data.msgObj.payload.source) and data.msgObj.meta.tenantId='your-tenant-here.cloud.turbot.com' and data.msgObj.payload.source='aws.ec2'| filter data.msgObj.type='event.turbot.com:External'| stats count() as Count by data.msgObj.meta.tenantId as Tenant, data.msgObj.payload.source as Source, data.msgObj.meta.eventRaw as EventName| sort Count desc | limit 5
Step 10: Measures To Fix Event Flood
Isolate the Noisy Workspace: As an immediate fix, move the noisy workspace to a separate TE version to prevent performance issues or throttling for neighboring workspaces.
Exclude Events from the Event Poller: If the event is coming from the Event Poller, update the policy AWS > Turbot > Event Poller > Excluded Events to exclude the problematic event.
Additionally, you may off-board the affected account until the issue is resolved or turn off the relevant policies where applicable to prevent further disruptions.
If you encounter further issues, please Open Support Ticket and attach the relevant information to assist you more efficiently.
- A screenshot of the CloudWatch Events Queue Backlog graph.
- A screenshot of the CloudWatch Log Insights Query results.