Incremental Loads
  • 26 Jul 2021
  • 2 Minutes to read
  • Dark
    Light
  • PDF

Incremental Loads

  • Dark
    Light
  • PDF

Article Summary

Default Behavior

On data sources that support it, Panoply uses incremental loads to pull only the data that was updated or added since the last pull and then updates the rows that have been changed. For some data sources, this is the default behavior. On others, users can define an incremental key when setting up the Advanced Settings. This key is then used to identify the most up-to-date rows, saving vast amounts of time in the collection process and improving performance.

For API sources, Panoply's default behavior collects the entire data set from the source every time they are collected. For data sources with small data sets, that's fine. In those cases, the data set is small enough to be fully consumed in just a few minutes.

Large Data Sets

With larger data sets, such as Shopify or Stripe, this can quickly become unfeasible. With a fast-growing data set, reading the entire source might take hours, most of which is spent re-reading unchanged data. For these data sources, Panoply supports the ability to define an incremental key. In most API sources, incremental loads are already included, so there's no need to make any schema changes in your data source.

Incremental Keys

For data sources that do not have the default behavior built in an incremental key can be defined by the user. An incremental key is an attribute of the collected data that can be reliably assumed to indicate the last update point for the rows in that data source. Simple examples are attributes like modification_date. These attributes, once configured, can be used by Panoply to skip over all unchanged rows in the data source, and only fetch the new or updated rows on every iteration, thus reducing the processing time from hours to minutes or seconds.

Whenever Panoply reads a data source, it first reads the last value stored for that incremental key. Then, it uses this value to skip over all rows below that point, assumed to be unchanged since the last time the data source was consumed.

In some data sources, this is simple to reason out and set up. For example, using Amazon S3, we can use the ModifiedDate to avoid re-reading files that were already processed and weren't modified. But in other data sources, like Postgres or Mysql, we really can't make any assumption about the data, unless you explicitly define an incremental key from that data source.

For database and file system sources, you can enter anything as the incremental key, but certain types of fields are most common, such as timestamps or incremental IDs. For APIs, the most common timestamp syntax is ISO 8601 format: yyyy-mm-ddThh:mm:ss.sssZ.


Was this article helpful?

What's Next