Announcement

Datatank: Blow past API speed limits with scheduled data sync

Pipes queries deliver more live data faster than you’d think possible, but APIs do impose limits. Now you can blow past them with Datatank tables that sync your live tables to the degree of freshness you choose.

Turbot Team

5 min. read - Oct 05, 2023

Pipes queries deliver more live data faster than you’d think possible, but APIs do impose limits. Now you can blow past them with Datatank tables that sync your live tables to the degree of freshness you choose.

Pipes enables users to schedule and share persistent snapshots of their workspaces. Now it also provides a powerful – and complementary – form of persistence: Datatank. With this new feature you can automatically sync live data with persistent tables. Why? Although Pipes parallelizes your queries and can pull data insanely fast, the laws of data physics still apply. If you select * from aws_s3_bucket for many accounts, you'll trigger a flurry of API calls. That's why our guidance has been Not to Select *. But no longer. With Datatank you can query broadly, in scheduled background tasks, without tapping your fingers waiting for results. The data lands in persistent tables that respond instantly, speed up benchmarks and dashboards, and stay as fresh as you need them to be. It's the best of both worlds: instant access to live data at scale when APIs can support it, and instant access to nearly-live data when they can't.

Create a Datatank

From the setting/connection page in your workspace, click New Connection and choose the Create Datatank option. Then name it, describe it, and click the Create Datatank button.

Synchronize a live table

You can create a Datatank in a workspace that uses the db1.small instance type.

When you create a new Datatank, you're prompted to add a table. It's as easy as picking a schema from the ones available in your workspace, and choosing a refresh frequency. In this example we choose the all_aws schema (which aggregates a set of AWS connections), the aws_s3_bucket table, and Daily frequency.

The query still takes as much time as it does, which in our demo example can be up to 30 seconds. Once it's done you'll see this report, which shows that the table built successfully from all three underlying connections.

Now, when you visit the query pane you'll see a new my_datatank schema and within it an aws_s3_bucket table that supersedes the corresponding live table. When you run select * from aws_s3_bucket here, the response will always be instant. Dashboards that depend on the table will always be instant too.

The data will always be as fresh as you need it to be. Need more granular control beyond Weekly/Daily/Hourly? Write cron expressions to query on any schedule you like, and as often as you want.

Persist a view

Not every table can be synchronized this way. Some tables require qualifiers in where or join .. on clauses. For these cases, you can create a Datatank table from a query that specifies the qualifiers. To do that, we'll visit our my_datatank, click the New Table button, and choose the Create Table From Query option.

Here's a query to aggregate max CPU utilization for EC2 instances.

select
  id,
  label,
  timestamp,
  period,
  value,
  expression
from
  aws_cloudwatch_metric_data_point
where
  id = 'm1'
  and expression = 'select max(CPUUtilization) from schema("AWS/EC2", InstanceId)'
order by
  timestamp;

There are two required "quals": id and expression. Let's create the Datatank table m1_max_cpu and schedule it to run daily.

You're prompted to use the All at once method which updates the table atomically from all aggregated connections. For more granular control you can choose the Per connection method to update connections independently and be more resilient to failure. Either way, the query takes as long as it initially does; thereafter you have a table that always responds instantly.

See it in action

Get started with Datatank

Steampipe's live-data approach broke new ground. Nobody, ourselves included, thought it would be possible to pull so much data so fast from cloud APIs. That remains a core strength of Turbot Pipes. If you can get live data instantly, you should; perfect freshness is ideal and you can often achieve it. Now, for those times when you can't, you can get nearly-live data instantly. It's effectively another cache that's under your control, easy to use, and able to speed up any slow query. We can't wait to hear your stories once you put your foot down on the Datatank accelerator!