data-diff
data-diff copied to clipboard
diff tables reports `NotImplementedError: Cannot use a column of type` if the table is empty
Describe the bug
When performing a diff between a table in postres and snowflake -- the diff function will fail if Snowflake primary key field is a TEXT
field and the table is empty.
I noticed in the logs for successful diffs that tables that were not empty were reporting that the id field was a String_UUID
-- for example:
[2024-04-28, 04:20:44 UTC] {schema.py:44} INFO - [PostgreSQL] Schema = {'id': Native_UUID(_notes=[]), 'updated_at': Timestamp(_notes=[], precision=6, rounds=True)}
[2024-04-28, 04:22:38 UTC] {schema.py:44} INFO - [Snowflake] Schema = {'ID': String_UUID(_notes=[], collation=None, lowercase=True, uppercase=False), 'UPDATED_AT': Timestamp(_notes=[], precision=9, rounds=False)}
For the table that failed the log looked like this:
[2024-04-28, 08:06:27 UTC] {schema.py:44} INFO - [PostgreSQL] Schema = {'id': Native_UUID(_notes=[]), 'validation_logic_json': Text(_notes=[], collation=None), 'in_control_logic_json': Text(_notes=[], collation=None)}
[2024-04-28, 08:08:27 UTC] {schema.py:44} INFO - [Snowflake] Schema = {'ID': Text(_notes=[], collation=None), 'IN_CONTROL_LOGIC_JSON': Text(_notes=[], collation=None), 'VALIDATION_LOGIC_JSON': Text(_notes=[], collation=None)}
I dug in a little deeper and noticed that the types are determined by sample some of the data via that function _refine_coltypes
. It feels a bit like without any values to sample it defaults the type to the type in the db and then eventually raises this NotImplementedError
Make sure to include the following (minus sensitive information): We are importing the package and calling it from a python script in airflow -- the abreviated code looks like this:
table1: TableSegment = connect_to_table(**postgres_details)
table2: TableSegment = connect_to_table(**snowflake_details)
differ = HashDiffer()
diff_result_wrapper: DiffResultWrapper = differ.diff_tables(table1, table2)
...
Here is an excerpt of the traceback:
File "/usr/local/airflow/dags/warehouse/data_engineering/avalanche_data_integrity/data_differ.py", line 60, in compare_dbs
control_flow(
File "/usr/local/airflow/dags/warehouse/data_engineering/avalanche_data_integrity/helpers.py", line 945, in control_flow
diff_magic(
File "/usr/local/airflow/dags/warehouse/data_engineering/avalanche_data_integrity/helpers.py", line 744, in diff_magic
for sign, columns in diff_result_wrapper:
File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 95, in __iter__
for i in self.diff:
File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 266, in _diff_tables_wrapper
raise error
File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 239, in _diff_tables_wrapper
yield from self._diff_tables_root(table1, table2, info_tree)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 274, in _diff_tables_root
return self._bisect_and_diff_tables(table1, table2, info_tree)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 299, in _bisect_and_diff_tables
raise NotImplementedError(f"Cannot use a column of type {kt} as a key")
NotImplementedError: Cannot use a column of type Text(_notes=[], collation=None) as a key
If possible, please paste these as text, and not a screenshot.
Describe the environment
Running in airflow with data-diff==0.11.1