data-diff
data-diff copied to clipboard
ValueError: invalid literal for int() with base 36
Hi Team,
I am trying to run diff between postgres and bq. My primary key column is string(uuid). Example - 'zZsNL-xxt2o-REN'
When doing the diff, I am getting the following exception:
[2022-09-05, 11:40:45 AEST] {taskinstance.py:1902} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/data_diff/diff_tables.py", line 421, in _parse_key_range_result
return cls(mn), cls(mx) + 1
File "/home/airflow/.local/lib/python3.8/site-packages/data_diff/utils.py", line 94, in __add__
res = self.new(int=self.int + other)
File "/home/airflow/.local/lib/python3.8/site-packages/data_diff/utils.py", line 75, in int
return int(self._str, len(alphanums))
ValueError: invalid literal for int() with base 36: 'zZsNL-xxt2o-REN'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 171, in execute
return_value = self.execute_callable()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 189, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/opt/airflow/dags/dataquality_metrics/utils/quality_metrics.py", line 54, in find_missing_rows
raise ev
File "/opt/airflow/dags/dataquality_metrics/utils/quality_metrics.py", line 48, in find_missing_rows
diff = list(dif_iter)
File "/home/airflow/.local/lib/python3.8/site-packages/data_diff/diff_tables.py", line 414, in diff_tables
raise error
File "/home/airflow/.local/lib/python3.8/site-packages/data_diff/diff_tables.py", line 367, in diff_tables
min_key1, max_key1 = self._parse_key_range_result(key_type, next(key_ranges))
File "/home/airflow/.local/lib/python3.8/site-packages/data_diff/diff_tables.py", line 423, in _parse_key_range_result
raise type(e)(f"Cannot apply {key_type} to {mn}, {mx}.") from e
ValueError: Cannot apply String_Alphanum(length=15) to 016AL-NjNQP-REN, zZsNL-xxt2o-REN.
I have tried recreating this error by taking the following samples from utils.py code
import string
alphanums = string.digits + string.ascii_lowercase
def numberToBase(num, base):
digits = []
while num > 0:
num, remainder = divmod(num, base)
digits.append(remainder)
return "".join(alphanums[i] for i in digits[::-1])
xx = 'zZsNL-xxt2o-REN'
yy = int(xx, len(alphanums))
str = numberToBase(yy, len(alphanums))
Hi.
Same error here using v0.2.5:
ValueError: Cannot apply String_Alphanum(length=13) to 0840001-50--1, 9635099-89-PV23.
Stack Trace:
File ".../__pypackages__/3.9/lib/data_diff/diff_tables.py", line 414, in diff_tables
raise error
File ".../__pypackages__/3.9/lib/data_diff/diff_tables.py", line 367, in diff_tables
min_key1, max_key1 = self._parse_key_range_result(key_type, next(key_ranges))
File ".../__pypackages__/3.9/lib/data_diff/diff_tables.py", line 423, in _parse_key_range_result
raise type(e)(f"Cannot apply {key_type} to {mn}, {mx}.") from e
The above exception was caused by the following exception:
ValueError: Expected alphanumeric value of length 13, but got '9635099-89-PV23'.
Thanks for the effort and the nice tool!
Hi @IgnorantWalking !
Our alphanum implementation doesn't expect to see dashes (-
) between the numbers, unless it's a UUID.
A few questions:
- In your system, do dashes count as a digit?
- Do you have any other special symbols in your keys?
- Are your keys varying in length? (It looks like some are 13, some are 15, etc.)
Hi @erezsh.
- Yes
- Yes, any other alphanumeric character could be part of the key :/
- Yes :/
But, let me explain the specific use case because maybe other approaches are possible:
We're trying to compare a table that has an internal numeric ID, used as the private key, and an alphanumeric unique "business key". The problem is that the ID uses different sequences in the two different databases we're trying to compare, so their values don't match. All the other columns have the same contents, but using a different internal ID.
So we're trying to use the alphanumeric business key as the key column to make the comparison and fill the "extra columns" parameter with all the other table columns, all but the ID, that we want to keep outside of the comparison.
Yes, I understand your use-case, it makes sense to me.
Keys of varying size aren't supported right now, but they are on our TODO. I believe they shouldn't be too hard to add.
As for (2), I wouldn't consider -
to be alphanumeric. Do you have any other non-alphanumeric characters? Or is it only "a" to "z", "A" to "Z", and "-" ?
Yeah, alphanumeric characters and the "-". Something like: [a-zA-Z0-9-]
Hi @IgnorantWalking ,
I have implemented a solution for this use-case in PR #235 .
It's a complete re-write of how we handle alphanumerics.
It would be really helpful if you could help us test it!
You can install it using:
pip install git+https://github.com/datafold/data-diff.git@refs/pull/235/head
You might have to first uninstall the current data-diff version in order to use it.
Fixed and merged to master.