data-diff
data-diff copied to clipboard
Support for postgres varchar of varying length as key
Hey there! We are looking to use data-diff for verifying some of our data pipelines. While testing it out, we noticed that our postgres varchar fields can not be used as keys. The error for that is below.
WARNING:database:Mixed Alphanum lengths detected in column *********, disabling Alphanum support.
NotImplementedError: Cannot use column of type Text() as a key
Here are some examples of the keys being sampled
x99999m
x9990037
x99_99943z1
Is it possible to add support for varchars with variable length as keys, if so do you have an estimate for the difficulty of adding it?
Hi @k-walton ,
I have implemented a solution for this use-case in PR #235 .
It would be really helpful if you could help us test it!
You can install it using:
pip install git+https://github.com/datafold/data-diff.git@refs/pull/235/head
You might have to first uninstall the current data-diff version in order to use it.
@erezsh Thanks for getting back so quickly! I tested it out and found a bug with the min and max keys
raise ValueError(f"Error: min_key expected to be smaller than max_key! ({self.min_key} >= {self.max_key})")
ValueError: Error: min_key expected to be smaller than max_key!
I'll try and get you an example soon to reproduce it without our real data, but I just wanted to give you a quick thanks until then!
Thanks for letting me know. I have some guess as to why that might happen, but a concrete failing example will still be helpful.
Even just the min_key / max_key values returned by the error would help verify if my guess is correct. If you don't want to post them on github, you're welcome to send them to me directly.
@k-walton I rewrote how alphanumerics work, and I think I got it right this time (I hope!)
Please give this new one a try!
@erezsh I pulled down the latest, and it seems to have fixed that issue!
There is another weird behavior though. I think it something to do with how the string key space is chunked up. The whole table of > 100 million rows seems to get selected at once. Here is the redacted output.
INFO:diff_tables:Diffing tables | segments: 32, bisection threshold: 16384. key-range: ******..******, size: 4.368436102755443e-06
DEBUG:database:Running SQL (PostgreSQL): SELECT "******"::varchar FROM "public"."******" WHERE ('MIN_KEY' <= "******") AND ("******" < 'MAX_KEY')
DEBUG:database:Running SQL (PostgreSQL): SELECT "******"::varchar FROM "public"."******" WHERE ('MIN_KEY' <= "******") AND ("******" < 'MAX_KEY')
INFO:diff_tables:Diff found 0 different rows.
I imagine some way of approximating distance between strings to define midpoints will have to be added? Maybe along the lines of https://splunktool.com/get-the-string-that-is-the-midpoint-between-two-other-strings? But that is just a random guess.
Regardless, once again thanks for the help!
Hi @k-walton , thanks for testing it.
I updated the PR, and I believe it should now segment the table correctly.
Please give it a try and tell me if it works.
I appreciate your help!
Okay, it should be good to go 👍
@erezsh I pulled down the latest again, and everything seems to be working as expected!
Will the fix work for redshift too?
I saw that the PR with the fix was merged recently, and I tried installing from that commit via pip install git+https://github.com/datafold/data-diff.git@67171cb7c2fb3ce4bd9beb341d2327eb0131c20c
but I'm getting the following error after running the command data-diff --conf ~/workspace/data-diff-backup/local/config.toml --run test_diff -v -k subscription_id
[16:26:06] WARNING - [Redshift] Column 'subscription_id' of type 'Text()' has no compatibility handling. If encoding/formatting differs between databases, it may result in false positives.
Traceback (most recent call last):
File "/Users/jayjohnson/.pyenv/versions/airflow/bin/data-diff", line 8, in <module>
sys.exit(main())
File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/data_diff/__main__.py", line 110, in main
return _main(**kw)
File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/data_diff/__main__.py", line 280, in _main
for op, values in diff_iter:
File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/data_diff/diff_tables.py", line 167, in diff_tables
raise error
File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/data_diff/diff_tables.py", line 109, in diff_tables
raise NotImplementedError(f"Cannot use column of type {key_type} as a key")
NotImplementedError: Cannot use column of type Text() as a key
The key has the type varying(535):
CREATE TABLE staging.subscriptions (
subscription_id character varying(535) ENCODE lzo,
...
if it's easier I can open a separate issue, please let me know
Issue solved and merged to master.
Hi @johnson-jay-l ,
It appears that data-diff didn't recognize your column as alphanumeric.
Please install the latest master branch, and try again. If you're still getting this error, please check if your alphanumeric column contains characters that we don't allow (i.e. not -, _ or a-z0-9)
Happy to create a new issue for this, but since it's relevant to the above. I'm curious why only that set of alphanumeric characters is permitted (-, _ or a-z0-9). We have several key columns that contain . as well - would it be trivial to support that as well?