astro-sdk Fix temporary table name creation

Describe the bug At the moment, Temporary tables names are created using: https://github.com/astro-projects/astro/blob/9a565e8fa0fa68b4be0c0dbd1866e0b531b8b902/src/astro/sql/table.py#L78-L86

Where MAX_TABLE_NAME_SIZE is currently 63 characters (e.g. Postgres does not support tables with longer names).

The problem is that depending on the DAG ID and the task names, there will be collisions.

Version

Astro: [e.g. 0.7.0]

To Reproduce

Write the DAG with a long name and two tasks with long names, which have the same prefix.

Example:

Create a DAG with the following DAG ID:

a_veeeeery_looooooooong_dag_identifier (40 characters)

Within the DAG, add two transform tasks that output to temporary tables. They should be named:

task_identifier_with_long_name_1 (32 characters)
task_identifier_with_long_name_2 (32 characters)

Run the DAG
See the error: The expectation would be, by the end of the DAG run, to see two temporary tables. However, at the moment, in this particular case, we'll see only one temporary table, with the results of the task which was last executed.

Expected behavior Temporary tables names should:

have under 63 characters
should be unique (across DAG runs and within the same DAG run)
should be logged to the end-users, so they can debug a problem
do not need to have a user-friendly name
should be compatible with the currently supported databases (BQ, Postgres, Snowflake, and Sqlite)

Mar 16 '22 11:03 tatiana

If you don't care that table_name don't include "dag_run.dag_id" and "ti.task_id" "dag_run.id",you can use hash function.

example

import zlib

table_name = f"{dag_run.dag_id}_{ti.task_id}_{dag_run.id}"
table_name=str(zlib.adler32(table_name.encode()))

Mar 16 '22 12:03 tetsuya111

@tetsuya111 that sounds like a great strategy, it should be simple to confirm the names generated are compatible with all the supported databases!

Mar 16 '22 13:03 tatiana

@tetsuya111 @tatiana using random hashes might create problems as we want to create the ability to delete temp tables automatically and I was hoping to do so using DAG ID and run ID.

Perhaps we can do something like

(First 4 letters of DAG ID)-(first 4 letters of run id)-(hash of dag_id+run_id)

to ensure no collisions while still making sure the table name is within size. WDYT?

Mar 16 '22 13:03 dimberman

@dimberman @tatiana - I think we should have a hybrid approach, something like - (First 4 letters of DAG ID)-(first 4 letters of run id)- (task_id) - (random string)

Pros

This way we can cater to scenarios where a single operator can produce multiple temp tables
We can use prefixes to find table names and then delete them.

select table_schema,
       table_name
from information_schema.tables
where table_name like 'SomePredictablePrefix%'
      and table_schema not in ('information_schema', 'pg_catalog')
      and table_type = 'BASE TABLE'
order by table_name,
         table_schema;

Apr 07 '22 21:04 utkarsharma2

We improved how we name temporary tables as part of the 0.9 release: https://github.com/astronomer/astro-sdk/blob/main/python-sdk/docs/CHANGELOG.md#090.

We can re-open this ticket or create a new one if we think the issue was not solved.

Jan 17 '23 09:01 tatiana

astro-sdk astro-sdk copied to clipboard

Fix temporary table name creation

astro-sdk
astro-sdk copied to clipboard