data-diff Support for postgres varchar of varying length as key

Hey there! We are looking to use data-diff for verifying some of our data pipelines. While testing it out, we noticed that our postgres varchar fields can not be used as keys. The error for that is below.

WARNING:database:Mixed Alphanum lengths detected in column *********, disabling Alphanum support.
NotImplementedError: Cannot use column of type Text() as a key

Here are some examples of the keys being sampled

x99999m 
x9990037
x99_99943z1

Is it possible to add support for varchars with variable length as keys, if so do you have an estimate for the difficulty of adding it?

Sep 14 '22 19:09 k-walton

Hi @k-walton ,

I have implemented a solution for this use-case in PR #235 .

It would be really helpful if you could help us test it!

You can install it using:

pip install git+https://github.com/datafold/data-diff.git@refs/pull/235/head

You might have to first uninstall the current data-diff version in order to use it.

Sep 15 '22 08:09 erezsh

@erezsh Thanks for getting back so quickly! I tested it out and found a bug with the min and max keys

    raise ValueError(f"Error: min_key expected to be smaller than max_key! ({self.min_key} >= {self.max_key})")
ValueError: Error: min_key expected to be smaller than max_key!

I'll try and get you an example soon to reproduce it without our real data, but I just wanted to give you a quick thanks until then!

Sep 15 '22 23:09 k-walton

Thanks for letting me know. I have some guess as to why that might happen, but a concrete failing example will still be helpful.

Sep 16 '22 06:09 erezsh

Even just the min_key / max_key values returned by the error would help verify if my guess is correct. If you don't want to post them on github, you're welcome to send them to me directly.

Sep 16 '22 08:09 erezsh

@k-walton I rewrote how alphanumerics work, and I think I got it right this time (I hope!)

Please give this new one a try!

Sep 16 '22 12:09 erezsh

@erezsh I pulled down the latest, and it seems to have fixed that issue!

There is another weird behavior though. I think it something to do with how the string key space is chunked up. The whole table of > 100 million rows seems to get selected at once. Here is the redacted output.

INFO:diff_tables:Diffing tables | segments: 32, bisection threshold: 16384. key-range: ******..******, size: 4.368436102755443e-06
DEBUG:database:Running SQL (PostgreSQL): SELECT "******"::varchar FROM "public"."******" WHERE ('MIN_KEY' <= "******") AND ("******" < 'MAX_KEY')
DEBUG:database:Running SQL (PostgreSQL): SELECT "******"::varchar FROM "public"."******" WHERE ('MIN_KEY' <= "******") AND ("******" < 'MAX_KEY')

INFO:diff_tables:Diff found 0 different rows.

I imagine some way of approximating distance between strings to define midpoints will have to be added? Maybe along the lines of https://splunktool.com/get-the-string-that-is-the-midpoint-between-two-other-strings? But that is just a random guess.

Regardless, once again thanks for the help!

Sep 16 '22 20:09 k-walton

Hi @k-walton , thanks for testing it.

I updated the PR, and I believe it should now segment the table correctly.

Please give it a try and tell me if it works.

I appreciate your help!

Sep 19 '22 09:09 erezsh

Okay, it should be good to go 👍

Sep 19 '22 14:09 erezsh

@erezsh I pulled down the latest again, and everything seems to be working as expected!

Sep 19 '22 15:09 k-walton

Will the fix work for redshift too?

I saw that the PR with the fix was merged recently, and I tried installing from that commit via pip install git+https://github.com/datafold/data-diff.git@67171cb7c2fb3ce4bd9beb341d2327eb0131c20c

but I'm getting the following error after running the command data-diff --conf ~/workspace/data-diff-backup/local/config.toml --run test_diff -v -k subscription_id

[16:26:06] WARNING - [Redshift] Column 'subscription_id' of type 'Text()' has no compatibility handling. If encoding/formatting differs between databases, it may result in false positives.
Traceback (most recent call last):
  File "/Users/jayjohnson/.pyenv/versions/airflow/bin/data-diff", line 8, in <module>
    sys.exit(main())
  File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/data_diff/__main__.py", line 110, in main
    return _main(**kw)
  File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/data_diff/__main__.py", line 280, in _main
    for op, values in diff_iter:
  File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/data_diff/diff_tables.py", line 167, in diff_tables
    raise error
  File "/Users/jayjohnson/.pyenv/versions/airflow/lib/python3.7/site-packages/data_diff/diff_tables.py", line 109, in diff_tables
    raise NotImplementedError(f"Cannot use column of type {key_type} as a key")
NotImplementedError: Cannot use column of type Text() as a key

The key has the type varying(535):

CREATE TABLE staging.subscriptions (
    subscription_id character varying(535) ENCODE lzo,
...

if it's easier I can open a separate issue, please let me know

Sep 28 '22 23:09 johnson-jay-l

Issue solved and merged to master.

Sep 30 '22 09:09 erezsh

Hi @johnson-jay-l ,

It appears that data-diff didn't recognize your column as alphanumeric.

Please install the latest master branch, and try again. If you're still getting this error, please check if your alphanumeric column contains characters that we don't allow (i.e. not -, _ or a-z0-9)

Sep 30 '22 09:09 erezsh

Happy to create a new issue for this, but since it's relevant to the above. I'm curious why only that set of alphanumeric characters is permitted (-, _ or a-z0-9). We have several key columns that contain . as well - would it be trivial to support that as well?

Jan 25 '23 15:01 RoderickJDunn

data-diff data-diff copied to clipboard

Support for postgres varchar of varying length as key

data-diff
data-diff copied to clipboard