pramen icon indicating copy to clipboard operation
pramen copied to clipboard

Add support for incremental ingestion

Open yruslan opened this issue 11 months ago • 0 comments

Background

Currently, incremental updates are made by overwriting the latest info date partitions multiple times a day. This can be inefficient, especially for big tables with many events.

If the source table has a monotonically increasing field (a timestamp, a number, etc), it can be used for updates to the latest partition.

Feature

Add support for incremental ingestion (in Pramen).

Example

offset.column {
  name = "created_at"
  type = "datetime"
}

Proposed Solution

This PR adds 'incremental' as a schedule type, and mechanisms for managing offsets (experimental).

Pramen version 1.10 introduces the concept of incremental ingestion. It allows running a pipeline multiple times a day without reprocessing data that was already processed. In order to enable it, use incremental schedule when defining your ingestion operation:

schedule = "incremental"

In order for the incremental ingestion to work you need to define a monotonically increasing field, called an offset. Usually, this incremental field can be a counter, or a record creation timestamp. You need to define the offset field in your source. The source should support incremental ingestion in order to use this mode.

offset.column {
  name = "created_at"
  type = "datetime"
}

Offset types available at the moment:

Type Description
integral Any integral type (short, int, long)
datetime A datetime or timestamp fields
string Only string / varchar(n) types.

Only ingestion jobs support incremental schedule at the moment. Incremental transformations and sinks are planned to be available soon.

Pramen PR: https://github.com/AbsaOSS/pramen/pull/487

After the completion of the issue an epic will be created for Aqueduct.

yruslan avatar Mar 15 '24 11:03 yruslan