sdk icon indicating copy to clipboard operation
sdk copied to clipboard

For SQL taps, add support for `custom_queries`

Open aaronsteers opened this issue 2 years ago • 4 comments

Sometimes DB table columns does not support exactly the correct properties we want to query. For instance, a table having updated_at and created_at, where updated_at is null until a post-creation update occurs. Or sometimes a join between two tables can better represent the proper subset of data we care about.

A proposal would accept custom SQL queries as stream definitions within the tap config.

config.json

{
  // ...
  "custom_queries":
  [
    {
      "stream_name": "custom-products"
      "sql": "SELECT id, name, created_at, updated_at AS updated_at_raw, COALESCE(updated_at, created_at) as updated_at FROM products"
      "default_replication_key": "updated_at",
      "default_key_properties": ["id"]
      "schema": "..." // <<< JSON Schema definition here
    }
  ]
}

A few notes about this implementation:

  • The above should modify the behavior during discovery (--discover).
    • During discovery, the tap may auto-detect its own schema, in which case the provided schema would be unnecessary above.
    • Not all SQL implementations can discover their own schema - so for those implementations, we still need a schema input.
  • The default_ prefix signifies that what we are setting are the catalog-layer keys, which may be overridden.
    • Singer tools like Meltano which can override replication_key and key_properties in the catalog may still do so, in which case those catalog-layer overrides would have priority over the defaults entered above.
  • In the above, the extra config has a discovery-time role and also a sync-time role. Unless we develop a new convention to store the SQL query text itself in the catalog, the config will need to be present both during discovery and also during sync.

Inspired by Slack conversation: https://meltano.slack.com/archives/C01PKLU5D1R/p1676415407696879?thread_ts=1676351256.072409&cid=C01PKLU5D1R

aaronsteers avatar Feb 14 '23 22:02 aaronsteers

For some prior art we had at GitLab - our postgres_pipeline tool had some similar features https://gitlab.com/gitlab-data/analytics/-/tree/master/extract/postgres_pipeline

From the page:

pgp manifest definition: There are 5 mandatory sections and 2 optional sections in a pgp manifest. The 5 sections are as follows:

  1. import_db: the name of the database that is being imported from
  2. import_query: this is the SELECT query that is used to extract data from the database. They usually target a single table
  3. export_schema: this is the schema that the table lives in in the target database
  4. export_table: this is the name of the table that is being targeted for export by the query
  5. export_table_primary_key: this is the name of the column that is used as the primary key for the table. It is usually just id

The 6th optional section is called additional_filtering. This field is used when you need to add an additional condition to the import_query that isn't related to incremental loading, for instance to filter some bad rows.


https://gitlab.com/gitlab-data/analytics/-/blob/master/extract/postgres_pipeline/manifests_decomposed/el_customers_scd_db_manifest.yaml is an example manifest

connection_info:
  database: CUSTOMERS_DB_NAME
  host: CUSTOMERS_DB_HOST
  pass: CUSTOMERS_DB_PASS
  port: PG_PORT
  user: CUSTOMERS_DB_USER
tables:
  billing_accounts:
    advanced_metadata: true
    export_db: EXPORT_DB
    export_schema: 'customers'
    export_table: 'billing_accounts'
    export_table_primary_key: id
    import_db: CUSTOMERS_DB
    import_query: |-
      SELECT id
      , zuora_account_name
      , zuora_account_id
      , salesforce_account_id
      , created_at
      , updated_at
      FROM billing_accounts

I like the considerations around the catalog @aaronsteers . I think this would open up a lot of use cases for folks who don't have access to LOG_BASED replication. A big downside is run time however.

tayloramurphy avatar Feb 14 '23 23:02 tayloramurphy

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

stale[bot] avatar Jul 18 '23 03:07 stale[bot]

Still relevant

tayloramurphy avatar Jul 18 '23 22:07 tayloramurphy

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

stale[bot] avatar Jul 18 '24 03:07 stale[bot]