Building Notifications Pipelines
The ability to get data out of Guardrails has been greatly enhanced by the introduction of the firehose-aws-sns mod. To facilitate development of notification pipelines, this document describes a variety of architectural capabilities and considerations. A few example pipelines are also included.
Exploration
The Guardrails Activity tab will help in getting acquainted with how the environment operates as well as the volume of change Guardrails can create at scale.
- Create a storage bucket.
- Set some policy to enforce on this bucket.
- Examine the Activity tab for that bucket to get a feel for the rate and volume of change.
- Observe the rate of change of controls from
TBD
→Skip
→alarm
→ok
- Make a policy change at the regional level to get a feel for how Guardrails handles change at scale.
Sources of Change
A few sources of change to Guardrails or the cloud environment may induce high volumes of notifications. At a minimum, the notification pipeline should behave properly in all these scenarios.
- Changes to policy settings at the Guardrails, folder or resource levels.
- New, updated or destroyed cloud resources
- Mod installation
- Mod upgrades
- Turbot Guardrails Enterprise upgrades
Recommended Architectural Components
The Notifications pipeline should include some or all of the below capabilities:
- Queueing
- Time Delay
- Filter
- Routing / Recipient List
- Rate Limiting / Message Aggregation
- Data Transformation
- Templating
- Validation
- Pipeline Instrumentation
- External Data Lookup
- Global On/Off Switch
- Scheduled Tasks
- Pipeline Heartbeat
Developers can choose which of these architectural components suit their organizational needs and development time constraints.
Queueing
Queueing helps handle high notifications volumes by buffering . No one likes missed notifications. Queues also facilitate troubleshooting and notification examination.
Time Delay
Guardrails can generate hundreds or thousands of control alarms when a policy changes or when resources are created in bulk. A time delay step enables the developer to wait for Guardrails to finish processing enforcements. Sending out alarm
notifications as soon as they appear signals the end user that something is wrong while Guardrails may be still remediating.
Filter
While Guardrails Watches provide considerable control over which notifications are sent, additional filtering may be required.
Routing / Recipient List
Introducing a router allows the developer to choose a single or multiple destinations for a given message. Perhaps some notifications are only useful to the SEIM but other should be sent to end users. A router permits that choice to be made. Especially important since Guardrails will only send messages to a single Firehose SNS topic.
Rate Limiting / Message Aggregation
Rate limiting combined with Message Aggregation prevents sending high volumes of notifications. Instead of sending hundreds of emails when a policy has changed, send a single email with some reasonable message explaining what happened.
Users may also want a "Nothing changed in the last {time period}" notification, just to ensure that the pipeline is still operational.
Data Transformation
While Guardrails provides JSON, other downstream systems may require CSV, YAML or some other format. A data transform step provides the capability to make these changes.
Templating
The notifications that come from Guardrails will be in JSON. End users don't typically read JSON, so the raw notification will need to be turned into something readable.
Validation
As shown in the exploration above, Guardrails can emit alarm
events then quickly change them to ok
or skip
. A validation step checks back with Guardrails to ask "does this control still exist and is in alarm?". A simple GraphQL query to get the current control state would be sufficient. A validation step works closely with a time delay queue to give Guardrails time to process any pending remediations.
Pipeline Instrumentation
Someone is going to want statistics on how many notifications were sent or a way to know when the notifications pipeline is inoperable (for whatever reason).
External Data Lookup
A particular notification may require looking up data outside of the pipeline. This may be over HTTP, SQL, local file, etc.
Global On/Off Switch
There should be a big red Emergency Stop button for the pipeline.
Scheduled Tasks
Some notifications work best on a periodic interval, such as daily summaries.
Pipeline Heartbeat
The pipeline should be able to tell the difference between no activity because nothing is happening and no activity because something upstream of the pipeline has failed.
Gotchas
Volume of Change for Policy Values
Making changes to policy settings can induce really high messages volumes for the policy_values
notification type. It's easy to have a few hundred policy settings but hundreds of thousands of policy values. Make sure the message volume can be handled by the notifications pipeline.
False Positives on Alarms
It is normal for Guardrails to set a control to alarm
when processing a new resource or on a policy change. Avoid the situation where Guardrails sets a control to alarm
, the notification pipeline sends a nasty-gram, then Guardrails resolves the control by setting it to skip
or ok
. The Time Delay and Validation steps mentioned above should reduce or eliminate these kind of false positive notifications.
Example Pipelines
General Design Requirements
- Don't spam the user with duplicates or high volume for the "same problem".
- Notify of controls in 'alarm' state for at least 1 minute.
- Notify when Guardrails takes an action on a cloud resource.
- No interest in changes to resources, permissions, or favorites as these are controlled by a different pipeline or only of interest to individual Guardrails Admins.
- Guardrails is in Enforce mode.
Pipeline
Queues with different delays
In this case, requirements state that validation should be done on different delays.
Guardrails → Firehose → SNS → Lambda Subscription → Router (1)(2)(3)
- (1) → Queue (with time delay (1m)) → Validation → Rate Limit → Rate Limit/Data Aggregate → Template → Final Delivery to user via email
- (2) → Queue (with time delay (10m)) → Validation → Transform (JSON to CSV) → Final Delivery to BI dashboard that only accepts CSV.
- (3) → Queue (with time delay (5m)) → Validation → Data Lookup → Transform (JSON to YAML) → Final Delivery to Ticketing system
Validate all Notifications
A slightly more efficient pipeline. Since we want to validate all notifications, we can just swap the order of the queue with the router. Since all our sub-pipelines don't require super fast responses/actions, a single small delay is fine.
Guardrails → Firehose → SNS → SQS Queue (with time delay) → Validation > Router (1)(2)(3)
- (1) Rate Limit → Rate Limit/Data Aggregate → Template → Final Delivery to user via email
- (2) Transform (JSON to CSV) → Final Delivery to BI dashboard that only accepts CSV.
- (3) Data Lookup → Transform (JSON to YAML) → Final Delivery to Ticketing system
Simple Pipeline for Email Notifications
A simple approach where each action Guardrails takes will result in one notification to the user.
Guardrails → Firehose → SNS → SQS Queue (with time delay) → Validation → Router (1)
- (1) Rate Limit/Data Aggregate → Template → Email Address Lookup → Final Delivery to user via email
Troubleshooting
Refer to the Cloudwatch logs for /aws/lambda/turbot-turbotfirehoseawssnssender-{hash}
if messages don't appear as expected. These logs only cover publishing to the topic by the firehose. Activity after publishing to the topic cannot be covered here.
Reference
- Firehose Installation Instructions: Basic overview and capabilities
- Firehose Terraform Bootstrap: Terraform for setting up the SNS topic then configuring the appropriate Guardrails policies. Requires AWS and Guardrails credentials to execute.
- Firehose Notification Templates: Each Notification type has a template. These can be altered to include or exclude required information. These templates exclusively alter the formatting and included info that is sent to the Firehose SNS topic. Be conservative with changes here.