sdk
sdk copied to clipboard
For SQL taps, add support for `custom_queries`
Sometimes DB table columns does not support exactly the correct properties we want to query. For instance, a table having updated_at and created_at, where updated_at is null until a post-creation update occurs. Or sometimes a join between two tables can better represent the proper subset of data we care about.
A proposal would accept custom SQL queries as stream definitions within the tap config.
config.json
{
// ...
"custom_queries":
[
{
"stream_name": "custom-products"
"sql": "SELECT id, name, created_at, updated_at AS updated_at_raw, COALESCE(updated_at, created_at) as updated_at FROM products"
"default_replication_key": "updated_at",
"default_key_properties": ["id"]
"schema": "..." // <<< JSON Schema definition here
}
]
}
A few notes about this implementation:
- The above should modify the behavior during discovery (
--discover).- During discovery, the tap may auto-detect its own schema, in which case the provided
schemawould be unnecessary above. - Not all SQL implementations can discover their own schema - so for those implementations, we still need a
schemainput.
- During discovery, the tap may auto-detect its own schema, in which case the provided
- The
default_prefix signifies that what we are setting are the catalog-layer keys, which may be overridden.- Singer tools like Meltano which can override
replication_keyandkey_propertiesin the catalog may still do so, in which case those catalog-layer overrides would have priority over the defaults entered above.
- Singer tools like Meltano which can override
- In the above, the extra config has a discovery-time role and also a sync-time role. Unless we develop a new convention to store the SQL query text itself in the
catalog, the config will need to be present both during discovery and also during sync.
Inspired by Slack conversation: https://meltano.slack.com/archives/C01PKLU5D1R/p1676415407696879?thread_ts=1676351256.072409&cid=C01PKLU5D1R
For some prior art we had at GitLab - our postgres_pipeline tool had some similar features https://gitlab.com/gitlab-data/analytics/-/tree/master/extract/postgres_pipeline
From the page:
pgp manifest definition: There are 5 mandatory sections and 2 optional sections in a pgp manifest. The 5 sections are as follows:
- import_db: the name of the database that is being imported from
- import_query: this is the SELECT query that is used to extract data from the database. They usually target a single table
- export_schema: this is the schema that the table lives in in the target database
- export_table: this is the name of the table that is being targeted for export by the query
- export_table_primary_key: this is the name of the column that is used as the primary key for the table. It is usually just id
The 6th optional section is called additional_filtering. This field is used when you need to add an additional condition to the import_query that isn't related to incremental loading, for instance to filter some bad rows.
https://gitlab.com/gitlab-data/analytics/-/blob/master/extract/postgres_pipeline/manifests_decomposed/el_customers_scd_db_manifest.yaml is an example manifest
connection_info:
database: CUSTOMERS_DB_NAME
host: CUSTOMERS_DB_HOST
pass: CUSTOMERS_DB_PASS
port: PG_PORT
user: CUSTOMERS_DB_USER
tables:
billing_accounts:
advanced_metadata: true
export_db: EXPORT_DB
export_schema: 'customers'
export_table: 'billing_accounts'
export_table_primary_key: id
import_db: CUSTOMERS_DB
import_query: |-
SELECT id
, zuora_account_name
, zuora_account_id
, salesforce_account_id
, created_at
, updated_at
FROM billing_accounts
I like the considerations around the catalog @aaronsteers . I think this would open up a lot of use cases for folks who don't have access to LOG_BASED replication. A big downside is run time however.
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.
Still relevant
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.