data-diff icon indicating copy to clipboard operation
data-diff copied to clipboard

ValueError: invalid literal for int() with base 36

Open tripathiabhay09 opened this issue 1 year ago • 6 comments

Hi Team,

I am trying to run diff between postgres and bq. My primary key column is string(uuid). Example - 'zZsNL-xxt2o-REN'

When doing the diff, I am getting the following exception:

[2022-09-05, 11:40:45 AEST] {taskinstance.py:1902} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/data_diff/diff_tables.py", line 421, in _parse_key_range_result
    return cls(mn), cls(mx) + 1
  File "/home/airflow/.local/lib/python3.8/site-packages/data_diff/utils.py", line 94, in __add__
    res = self.new(int=self.int + other)
  File "/home/airflow/.local/lib/python3.8/site-packages/data_diff/utils.py", line 75, in int
    return int(self._str, len(alphanums))
ValueError: invalid literal for int() with base 36: 'zZsNL-xxt2o-REN'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 171, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 189, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/dags/dataquality_metrics/utils/quality_metrics.py", line 54, in find_missing_rows
    raise ev
  File "/opt/airflow/dags/dataquality_metrics/utils/quality_metrics.py", line 48, in find_missing_rows
    diff = list(dif_iter)
  File "/home/airflow/.local/lib/python3.8/site-packages/data_diff/diff_tables.py", line 414, in diff_tables
    raise error
  File "/home/airflow/.local/lib/python3.8/site-packages/data_diff/diff_tables.py", line 367, in diff_tables
    min_key1, max_key1 = self._parse_key_range_result(key_type, next(key_ranges))
  File "/home/airflow/.local/lib/python3.8/site-packages/data_diff/diff_tables.py", line 423, in _parse_key_range_result
    raise type(e)(f"Cannot apply {key_type} to {mn}, {mx}.") from e
ValueError: Cannot apply String_Alphanum(length=15) to 016AL-NjNQP-REN, zZsNL-xxt2o-REN.

I have tried recreating this error by taking the following samples from utils.py code

import string 
alphanums = string.digits + string.ascii_lowercase
def numberToBase(num, base):
    digits = []
    while num > 0:
        num, remainder = divmod(num, base)
        digits.append(remainder)
    return "".join(alphanums[i] for i in digits[::-1])

xx = 'zZsNL-xxt2o-REN'
yy = int(xx, len(alphanums))
str = numberToBase(yy, len(alphanums))  

tripathiabhay09 avatar Sep 05 '22 02:09 tripathiabhay09

Hi.

Same error here using v0.2.5:

ValueError: Cannot apply String_Alphanum(length=13) to 0840001-50--1, 9635099-89-PV23.

Stack Trace:
  File ".../__pypackages__/3.9/lib/data_diff/diff_tables.py", line 414, in diff_tables
    raise error
  File ".../__pypackages__/3.9/lib/data_diff/diff_tables.py", line 367, in diff_tables
    min_key1, max_key1 = self._parse_key_range_result(key_type, next(key_ranges))
  File ".../__pypackages__/3.9/lib/data_diff/diff_tables.py", line 423, in _parse_key_range_result
    raise type(e)(f"Cannot apply {key_type} to {mn}, {mx}.") from e

The above exception was caused by the following exception:
ValueError: Expected alphanumeric value of length 13, but got '9635099-89-PV23'.

Thanks for the effort and the nice tool!

IgnorantWalking avatar Sep 06 '22 07:09 IgnorantWalking

Hi @IgnorantWalking !

Our alphanum implementation doesn't expect to see dashes (-) between the numbers, unless it's a UUID.

A few questions:

  1. In your system, do dashes count as a digit?
  2. Do you have any other special symbols in your keys?
  3. Are your keys varying in length? (It looks like some are 13, some are 15, etc.)

erezsh avatar Sep 06 '22 07:09 erezsh

Hi @erezsh.

  1. Yes
  2. Yes, any other alphanumeric character could be part of the key :/
  3. Yes :/

But, let me explain the specific use case because maybe other approaches are possible:

We're trying to compare a table that has an internal numeric ID, used as the private key, and an alphanumeric unique "business key". The problem is that the ID uses different sequences in the two different databases we're trying to compare, so their values don't match. All the other columns have the same contents, but using a different internal ID.

So we're trying to use the alphanumeric business key as the key column to make the comparison and fill the "extra columns" parameter with all the other table columns, all but the ID, that we want to keep outside of the comparison.

IgnorantWalking avatar Sep 06 '22 09:09 IgnorantWalking

Yes, I understand your use-case, it makes sense to me.

Keys of varying size aren't supported right now, but they are on our TODO. I believe they shouldn't be too hard to add.

As for (2), I wouldn't consider - to be alphanumeric. Do you have any other non-alphanumeric characters? Or is it only "a" to "z", "A" to "Z", and "-" ?

erezsh avatar Sep 06 '22 09:09 erezsh

Yeah, alphanumeric characters and the "-". Something like: [a-zA-Z0-9-]

IgnorantWalking avatar Sep 06 '22 11:09 IgnorantWalking

Hi @IgnorantWalking ,

I have implemented a solution for this use-case in PR #235 .

It's a complete re-write of how we handle alphanumerics.

It would be really helpful if you could help us test it!

You can install it using:

pip install git+https://github.com/datafold/data-diff.git@refs/pull/235/head

You might have to first uninstall the current data-diff version in order to use it.

erezsh avatar Sep 20 '22 09:09 erezsh

Fixed and merged to master.

erezsh avatar Sep 30 '22 09:09 erezsh