Events Queue Backlog (Count)
AlarmName: <prefix>_<te_version>_events_queue_messages_visible_alarm
(example: turbot_5_40_0_events_queue_messages_visible_alarm
).
Configuration: The Lambda reserved concurrency is configured in the
"Advanced - Worker" section of TEF. All the new TE installations will pick the
value from the TEF. You can also over ride this value at the TE by giving it a
new value. When making any changes to the WorkerLambdaReservedConcurrency
at
the TEF, please remember to flip the "Parameter Deployment Trigger" of the TE
from Blue to Green or vice-versa.
Condition: ApproximateNumberOfMessagesVisible >= 5000 for 3 datapoints within 15 minutes.
Healthy: Backlog (red) and running tasks (orange) are low (often zero) with short buildups (< 30 mins) associated with events like resource imports, mod updates, etc.
Overloaded: Sustained backlog growth (red) indicating that ECS workers cannot keep up with incoming events. Increase capacity or check for a flood of unnecessary events.
Stalled processing: Running tasks (orange) appear for long periods. Typically indicates a misconfigured mod failing to terminate the task.
Troubleshooting
- How many events are in the events queue?: 10K, 100K, 1 million?
- When did the event backlog start?: To the nearest 15 minutes, when did the backlog start?
- Worker Concurrency: Is the worker concurrency maxed out?
- Policy Changes: Were any policy changes made before the start of the backlog?
- Guardrails Master Changes: Were any changes made to the Guardrails application?
Answer to these questions will come from the Turbot Guardrails Enterprise (TE) CloudWatch dashboard.
If the backlog is not consistent and can be co-related to an activity that was just performed on the workspace, then no action is needed. Give it some time for the backlog to clear. Activities such a installing/upgrading a mod, importing an account, setting a policy that effects thousands of resources usually generates a lot of queue backlog and should settle down soon.
Crosscheck the settings of the environment and make sure there is no
misconfiguration. The Lambda Concurrency should not be too high and should be
fine tuned by observing for a couple of days. Lambda concurrency of 80 is a safe
number, but depends on the size of the installation. Also look for the TEF
parameters WorkerLambdaMessageBatch
and WorkerLambdaMaxDBConnections
, these
should be the same value and most of the time it is 2. It is configured to 4 in
rare occasions. The lambda concurrency should not be too high to overwhelm the
DB connections. Keep an eye on the number of DB connections when you are
increasing the lambda concurrency.
If the backlog is consistent and you see a pattern in the backlog queue getting flooded, please refer to the event flood documentation.
Need help?
Please reach out to Turbot Support with the following information on your Guardrails environment:
- Turbot Guardrails Enterprise (TE) Version(s)?: (e.g. v5.40.0)
- Turbot Guardrails Master Account: (e.g. 111222333444)
- The
WorkerLambdaMessageBatch
,WorkerLambdaMaxDBConnections
andWorkerLambdaReservedConcurrency
parameter values from TEF. - A screenshot of the events queue backlog in CloudWatch.