astro-sdk
astro-sdk copied to clipboard
Fix temporary table name creation
Describe the bug At the moment, Temporary tables names are created using: https://github.com/astro-projects/astro/blob/9a565e8fa0fa68b4be0c0dbd1866e0b531b8b902/src/astro/sql/table.py#L78-L86
Where MAX_TABLE_NAME_SIZE
is currently 63 characters (e.g. Postgres does not support tables with longer names).
The problem is that depending on the DAG ID and the task names, there will be collisions.
Version
- Astro: [e.g. 0.7.0]
To Reproduce
- Write the DAG with a long name and two tasks with long names, which have the same prefix.
Example:
Create a DAG with the following DAG ID:
-
a_veeeeery_looooooooong_dag_identifier
(40 characters)
Within the DAG, add two transform tasks that output to temporary tables. They should be named:
-
task_identifier_with_long_name_1
(32 characters) -
task_identifier_with_long_name_2
(32 characters)
-
Run the DAG
-
See the error: The expectation would be, by the end of the DAG run, to see two temporary tables. However, at the moment, in this particular case, we'll see only one temporary table, with the results of the task which was last executed.
Expected behavior Temporary tables names should:
- have under 63 characters
- should be unique (across DAG runs and within the same DAG run)
- should be logged to the end-users, so they can debug a problem
- do not need to have a user-friendly name
- should be compatible with the currently supported databases (BQ, Postgres, Snowflake, and Sqlite)
If you don't care that table_name don't include "dag_run.dag_id" and "ti.task_id" "dag_run.id",you can use hash function.
example
import zlib
table_name = f"{dag_run.dag_id}_{ti.task_id}_{dag_run.id}"
table_name=str(zlib.adler32(table_name.encode()))
@tetsuya111 that sounds like a great strategy, it should be simple to confirm the names generated are compatible with all the supported databases!
@tetsuya111 @tatiana using random hashes might create problems as we want to create the ability to delete temp tables automatically and I was hoping to do so using DAG ID and run ID.
Perhaps we can do something like
(First 4 letters of DAG ID)-(first 4 letters of run id)-(hash of dag_id+run_id)
to ensure no collisions while still making sure the table name is within size. WDYT?
@dimberman @tatiana - I think we should have a hybrid approach, something like - (First 4 letters of DAG ID)-(first 4 letters of run id)- (task_id) - (random string)
Pros
- This way we can cater to scenarios where a single operator can produce multiple temp tables
- We can use prefixes to find table names and then delete them.
select table_schema,
table_name
from information_schema.tables
where table_name like 'SomePredictablePrefix%'
and table_schema not in ('information_schema', 'pg_catalog')
and table_type = 'BASE TABLE'
order by table_name,
table_schema;
We improved how we name temporary tables as part of the 0.9 release: https://github.com/astronomer/astro-sdk/blob/main/python-sdk/docs/CHANGELOG.md#090.
We can re-open this ticket or create a new one if we think the issue was not solved.