astro-sdk icon indicating copy to clipboard operation
astro-sdk copied to clipboard

Fix temporary table name creation

Open tatiana opened this issue 2 years ago • 4 comments

Describe the bug At the moment, Temporary tables names are created using: https://github.com/astro-projects/astro/blob/9a565e8fa0fa68b4be0c0dbd1866e0b531b8b902/src/astro/sql/table.py#L78-L86

Where MAX_TABLE_NAME_SIZE is currently 63 characters (e.g. Postgres does not support tables with longer names).

The problem is that depending on the DAG ID and the task names, there will be collisions.

Version

  • Astro: [e.g. 0.7.0]

To Reproduce

  1. Write the DAG with a long name and two tasks with long names, which have the same prefix.

Example:

Create a DAG with the following DAG ID:

  • a_veeeeery_looooooooong_dag_identifier (40 characters)

Within the DAG, add two transform tasks that output to temporary tables. They should be named:

  • task_identifier_with_long_name_1 (32 characters)
  • task_identifier_with_long_name_2 (32 characters)
  1. Run the DAG

  2. See the error: The expectation would be, by the end of the DAG run, to see two temporary tables. However, at the moment, in this particular case, we'll see only one temporary table, with the results of the task which was last executed.

Expected behavior Temporary tables names should:

  • have under 63 characters
  • should be unique (across DAG runs and within the same DAG run)
  • should be logged to the end-users, so they can debug a problem
  • do not need to have a user-friendly name
  • should be compatible with the currently supported databases (BQ, Postgres, Snowflake, and Sqlite)

tatiana avatar Mar 16 '22 11:03 tatiana

If you don't care that table_name don't include "dag_run.dag_id" and "ti.task_id" "dag_run.id",you can use hash function.

example

import zlib

table_name = f"{dag_run.dag_id}_{ti.task_id}_{dag_run.id}"
table_name=str(zlib.adler32(table_name.encode()))

tetsuya111 avatar Mar 16 '22 12:03 tetsuya111

@tetsuya111 that sounds like a great strategy, it should be simple to confirm the names generated are compatible with all the supported databases!

tatiana avatar Mar 16 '22 13:03 tatiana

@tetsuya111 @tatiana using random hashes might create problems as we want to create the ability to delete temp tables automatically and I was hoping to do so using DAG ID and run ID.

Perhaps we can do something like

(First 4 letters of DAG ID)-(first 4 letters of run id)-(hash of dag_id+run_id)

to ensure no collisions while still making sure the table name is within size. WDYT?

dimberman avatar Mar 16 '22 13:03 dimberman

@dimberman @tatiana - I think we should have a hybrid approach, something like - (First 4 letters of DAG ID)-(first 4 letters of run id)- (task_id) - (random string)

Pros

  1. This way we can cater to scenarios where a single operator can produce multiple temp tables
  2. We can use prefixes to find table names and then delete them.
select table_schema,
       table_name
from information_schema.tables
where table_name like 'SomePredictablePrefix%'
      and table_schema not in ('information_schema', 'pg_catalog')
      and table_type = 'BASE TABLE'
order by table_name,
         table_schema;

utkarsharma2 avatar Apr 07 '22 21:04 utkarsharma2

We improved how we name temporary tables as part of the 0.9 release: https://github.com/astronomer/astro-sdk/blob/main/python-sdk/docs/CHANGELOG.md#090.

We can re-open this ticket or create a new one if we think the issue was not solved.

tatiana avatar Jan 17 '23 09:01 tatiana