Event Flood

Troubleshooting Event Flood in the AWS Console

What is an Event flood?

An event flood is a large and sustained backlog of events in the Guardrails Events queue. The "Events Queue Backlog" graph in the Turbot Guardrails Enterprise (TE) CloudWatch dashboard is the best place to see if an event flood is underway. Event floods may have several different causes. This document describes where to go looking for the cause. Resolution of the event flood will depend on the cause(s).

Initial Symptoms

Common Causes of Event Floods

Various causes can initiate and sustain an event flood. This list is not exhaustive.

It's not uncommon for event floods to arise from a combination of factors. Look for simple solutions to start.

How to look for an event flood?

If you suspect there is an event flood underway, start with these information sources:

TE CloudWatch Dashboard: Start here. The top two graphs on the TE dashboard will show if you're in a flood state. The "Activities" Section at the bottom of the TE dashboard is very helpful too.

TED CloudWatch Dashboard: Check DB and Redis health. While not common, an under-provisioned DB may cause problems.

RDS Performance Insights: If you're looking for additional information on what may be slowing down event processing.

In an installation with multiple tenants(workspaces), the first step is to filter/figure out the noisy tenant that is causing problems. The widget "View All Messages By Workspace" in TE Dashboard can be used to filter out the noisy tenant. The number of messages received by the top most tenant for specified duration and the difference between the top 3 tenants could be a good indicator of event flood.

Once the noisy tenant is spotted, the widget "View AWS External Messages by AWS Account ID and Events" should give more details about the troublesome account and the event within the tenant. Try to understand if the high activity of events co-relates with the number of accounts/regions within the workspace, also try to compare the numbers with the most busiest workspace within the environment.

Tips/Queries?

Navigate to CloudWatch > Logs Insights. Select the appropriate worker log group of the TE version in the drop-down (example: /aws/lambda/turbot_5_40_0_worker). Select the duration for which you want to query (example: 1 day or 3 days).

Please note that as the duration increases the Log group size data (in GB) and the time taken to query increases. This will in-turn increase the billing cost for CloudWatch.

You can further saves these queries to your CloudWatch Log Insights and run them on demand.

Please update the tenant accordingly from 'console-turbot.cloud.turbot.com' to your tenant Id.

fields @timestamp, @message
| filter message='received SQS message' and ispresent(data.msgObj.payload.account)
| filter data.msgObj.type='event.turbot.com:External'
| stats count() as Count by data.msgObj.meta.tenantId as Tenant | sort Count desc | limit 10
fields @timestamp, @message
| filter message='received SQS message' and ispresent(data.msgObj.payload.account) and data.msgObj.meta.tenantId='console-turbot.cloud.turbot.com'
| filter data.msgObj.type='event.turbot.com:External'
| stats count() as Count by data.msgObj.meta.tenantId as Tenant, data.msgObj.payload.account as AccountId
| sort Count desc | limit 15
fields @timestamp, @message
| filter message='received SQS message' and ispresent(data.msgObj.payload.account) and ispresent(data.msgObj.payload.source) and data.msgObj.meta.tenantId='console-turbot.cloud.turbot.com'
| filter data.msgObj.type='event.turbot.com:External'
| stats count() as Count by data.msgObj.meta.tenantId as Tenant, data.msgObj.payload.source as Source
| sort Count desc | limit 15
fields @timestamp, @message
| filter message='received SQS message' and ispresent(data.msgObj.payload.account) and ispresent(data.msgObj.payload.source) and data.msgObj.meta.tenantId='console-turbot.cloud.turbot.com' and data.msgObj.payload.account='123456789012'
| filter data.msgObj.type='event.turbot.com:External'
| stats count() as Count by data.msgObj.meta.tenantId as Tenant, data.msgObj.payload.account as AccountId, data.msgObj.payload.source as Source, data.msgObj.meta.eventRaw as EventName
| sort Count desc | limit 15
fields @timestamp, @message
| filter message='received SQS message' and ispresent(data.msgObj.payload.account) and ispresent(data.msgObj.payload.source) and data.msgObj.meta.tenantId='console-turbot.cloud.turbot.com' and data.msgObj.payload.source='aws.tagging'
| filter data.msgObj.type='event.turbot.com:External'
| stats count() as Count by data.msgObj.meta.tenantId as Tenant, data.msgObj.payload.account as AccountId, data.msgObj.payload.source as Source
| sort Count desc | limit 15

How to fix an event flood?

As an immediate fix, you can move the noisy workspace to a separate TE version so that the neighbouring workspaces are not facing any performance issues or throttles.

If the event is coming from Event Poller, update the policy AWS > Turbot > Event Poller > Excluded Events to exclude the event.

Talk to the respective internal team explaining the issue and ask them to look into the internal automations if any, and/or turn off policies wherever applicable.