data-diff icon indicating copy to clipboard operation
data-diff copied to clipboard

diff tables reports `NotImplementedError: Cannot use a column of type` if the table is empty

Open jsnb-devoted opened this issue 9 months ago • 0 comments

Describe the bug When performing a diff between a table in postres and snowflake -- the diff function will fail if Snowflake primary key field is a TEXT field and the table is empty.

I noticed in the logs for successful diffs that tables that were not empty were reporting that the id field was a String_UUID -- for example:

[2024-04-28, 04:20:44 UTC] {schema.py:44} INFO - [PostgreSQL] Schema = {'id': Native_UUID(_notes=[]), 'updated_at': Timestamp(_notes=[], precision=6, rounds=True)}
[2024-04-28, 04:22:38 UTC] {schema.py:44} INFO - [Snowflake] Schema = {'ID': String_UUID(_notes=[], collation=None, lowercase=True, uppercase=False), 'UPDATED_AT': Timestamp(_notes=[], precision=9, rounds=False)}

For the table that failed the log looked like this:

[2024-04-28, 08:06:27 UTC] {schema.py:44} INFO - [PostgreSQL] Schema = {'id': Native_UUID(_notes=[]), 'validation_logic_json': Text(_notes=[], collation=None), 'in_control_logic_json': Text(_notes=[], collation=None)}
[2024-04-28, 08:08:27 UTC] {schema.py:44} INFO - [Snowflake] Schema = {'ID': Text(_notes=[], collation=None), 'IN_CONTROL_LOGIC_JSON': Text(_notes=[], collation=None), 'VALIDATION_LOGIC_JSON': Text(_notes=[], collation=None)}

I dug in a little deeper and noticed that the types are determined by sample some of the data via that function _refine_coltypes. It feels a bit like without any values to sample it defaults the type to the type in the db and then eventually raises this NotImplementedError

Make sure to include the following (minus sensitive information): We are importing the package and calling it from a python script in airflow -- the abreviated code looks like this:

table1: TableSegment = connect_to_table(**postgres_details)
table2: TableSegment = connect_to_table(**snowflake_details)
differ = HashDiffer()
diff_result_wrapper: DiffResultWrapper = differ.diff_tables(table1, table2)
...

Here is an excerpt of the traceback:

  File "/usr/local/airflow/dags/warehouse/data_engineering/avalanche_data_integrity/data_differ.py", line 60, in compare_dbs
    control_flow(
  File "/usr/local/airflow/dags/warehouse/data_engineering/avalanche_data_integrity/helpers.py", line 945, in control_flow
    diff_magic(
  File "/usr/local/airflow/dags/warehouse/data_engineering/avalanche_data_integrity/helpers.py", line 744, in diff_magic
    for sign, columns in diff_result_wrapper:
  File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 95, in __iter__
    for i in self.diff:
  File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 266, in _diff_tables_wrapper
    raise error
  File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 239, in _diff_tables_wrapper
    yield from self._diff_tables_root(table1, table2, info_tree)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 274, in _diff_tables_root
    return self._bisect_and_diff_tables(table1, table2, info_tree)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 299, in _bisect_and_diff_tables
    raise NotImplementedError(f"Cannot use a column of type {kt} as a key")
NotImplementedError: Cannot use a column of type Text(_notes=[], collation=None) as a key

If possible, please paste these as text, and not a screenshot.

Describe the environment Running in airflow with data-diff==0.11.1

jsnb-devoted avatar Apr 30 '24 22:04 jsnb-devoted