Pipes enables users to schedule and share persistent snapshots of their workspaces. Now it also provides a powerful – and complementary – form of persistence: Datatank. With this new feature you can automatically syncs live data with persistent tables. Why? Although Pipes parallelizes your queries and can pull data insanely fast, the laws of data physics still apply. If you
select * from aws_s3_bucket for many accounts, you'll trigger a flurry of API calls. That's why our guidance has been Not to Select *. But no longer. With Datatank you can query broadly, in scheduled background tasks, without tapping your fingers waiting for results. The data lands in persistent tables that respond instantly, speed up benchmarks and dashboards, and stay as fresh as you need them to be. It's the best of both worlds: instant access to live data at scale when APIs can support it, and instant access to nearly-live data when they can't.
Create a Datatank
setting/connection page in your workspace, click
New Connection and choose the
Create Datatank option. Then name it, describe it, and click the
Create Datatank button.
Synchronize a live table
You can create a Datatank in a workspace that uses the
db1.small instance type.
When you create a new Datatank, you're prompted to add a table. It's as easy as picking a schema from the ones available in your workspace, and choosing a refresh frequency. In this example we choose the
all_aws schema (which aggregates a set of AWS connections), the
aws_s3_bucket table, and
The query still takes as much time as it does, which in our demo example can be up to 30 seconds. Once it's done you'll see this report, which shows that the table built successfully from all three underlying connections.
Now, when you visit the query pane you'll see a new
my_datatank schema and within it an
aws_s3_bucket table that supersedes the corresponding live table. When you run
select * from aws_s3_bucket here, the response will always be instant. Dashboards that depend on the table will always be instant too.
The data will always be as fresh as you need it to be. Need more granular control beyond
Weekly/Daily/Hourly? Write cron expressions to query on any schedule you like, and as often as you want.
Persist a view
Not every table can be synchronized this way. Some tables require qualifiers in
join .. on clauses. For these cases, you can create a Datatank table from a query that specifies the qualifiers. To do that, we'll visit our
my_datatank, click the
New Table button, and choose the
Create Table From Query option.
Here's a query to aggregate max CPU utilization for EC2 instances.
selectid,label,timestamp,period,value,expressionfromaws_cloudwatch_metric_data_pointwhereid = 'm1'and expression = 'select max(CPUUtilization) from schema("AWS/EC2", InstanceId)'order bytimestamp;
There are two required "quals":
expression. Let's create the Datatank table
m1_max_cpu and schedule it to run daily.
You're prompted to use the
All at once method which updates the table atomically from all aggregated connections. For more granular control you can choose the
Per connection method to update connections independently and be more resilient to failure. Either way, the query takes as long as it initially does; thereafter you have a table that always responds instantly.
See it in action
Get started with Datatank
Steampipe's live-data approach broke new ground. Nobody, ourselves included, thought it would be possible to pull so much data so fast from cloud APIs. That remains a core strength of Turbot Pipes. If you can get live data instantly, you should; perfect freshness is ideal and you can often achieve it. Now, for those times when you can't, you can get nearly-live data instantly. It's effectively another cache that's under your control, easy to use, and able to speed up any slow query. We can't wait to hear your stories once you put your foot down on the Datatank accelerator!