pramen
pramen copied to clipboard
Add support for incremental ingestion
Background
Currently, incremental updates are made by overwriting the latest info date partitions multiple times a day. This can be inefficient, especially for big tables with many events.
If the source table has a monotonically increasing field (a timestamp, a number, etc), it can be used for updates to the latest partition.
Feature
Add support for incremental ingestion (in Pramen).
Example
offset.column {
name = "created_at"
type = "datetime"
}
Proposed Solution
This PR adds 'incremental' as a schedule type, and mechanisms for managing offsets (experimental).
Pramen version 1.10
introduces the concept of incremental ingestion. It allows running a pipeline multiple times a day
without reprocessing data that was already processed. In order to enable it, use incremental
schedule when defining your
ingestion operation:
schedule = "incremental"
In order for the incremental ingestion to work you need to define a monotonically increasing field, called an offset. Usually, this incremental field can be a counter, or a record creation timestamp. You need to define the offset field in your source. The source should support incremental ingestion in order to use this mode.
offset.column {
name = "created_at"
type = "datetime"
}
Offset types available at the moment:
Type | Description |
---|---|
integral | Any integral type (short , int , long ) |
datetime | A datetime or timestamp fields |
string | Only string / varchar(n) types. |
Only ingestion jobs support incremental schedule at the moment. Incremental transformations and sinks are planned to be available soon.
Pramen PR: https://github.com/AbsaOSS/pramen/pull/487
After the completion of the issue an epic will be created for Aqueduct.