Wrongly assumed data type when mixing UUIDs and Alphanums
Description
When using a table with a mix of UUIDs and alphanumeric strings as primary keys, reladiff incorrectly assumes all primary keys are UUIDs. This causes the following error when sampling data:
packages/reladiff/diff_tables.py", line 322, in _parse_key_range_result
raise type(e)(f"Cannot apply {key_types} to '{min_key_values}', '{max_key_values}'.") from e
ValueError: Cannot apply [String_UUID()] to '('0001aa71db2877006cc4fd051d96195f',)', '('system52923',)'.
The issue appears when the primary keys contain values that follow UUID format (like 0001aa71db2877006cc4fd051d96195f) alongside other alphanumeric values (like system52923) that cannot be parsed as UUIDs.
Environment
- reladiff version: 0.6.0
- OS: macOS
- Database: PostgreSQL 15
Expected behavior
reladiff should handle mixed primary key formats correctly, potentially by using a more flexible type like String_Alphanum instead of assuming all keys are UUIDs.
Proposed solution
Is there a way to explicitly specify the primary key data type to enforce String_Alphanum or a similar more permissive type? Alternatively, could reladiff be enhanced to automatically fallback to a more general string type when UUID parsing fails?
Additional information
The issue occurs during data sampling when reladiff attempts to determine the primary key range. It appears that the type detection logic assumes homogeneous primary key types and fails when encountering mixed formats.
Slightly related to #51
Makes sense. Just to clarify, does the error happen because the random sampling only encounters UUIDs?
I think both ideas are good.
-
Fall-back when parsing fails. It's the most user-friendly, but might be tricky to implement. Possibly reladiff would have to restart the diffing process, as changing the type would re-order the keys, which would make it impractical to merge with the initial attempt. But maybe I'm wrong?
-
Forcing the type is the simplest and most efficient solution. It requires more user intervention, which is unfortunate, but I think it's reasonable for such an unusual situation.
For data quality checks, we expect the same result each time when using the same dataset.
I'm looking at the code right now. From what I see, reladiff doesn't do deterministic sampling or sort before sampling:
samples_by_row = self.query(
table(*table_path).select(*fields).where(Code(where) if where else SKIP).limit(sample_size), list
)
This raises a new concern. Without sorting, we get random samples each time. So one database might get samples that help detect the right data type, while another might get samples that lead to wrong data type assumptions.
Can we add sorting as a simple fix? This would make the behavior predictable.
Yes, that makes sense, and I don't see a reason not to. We already have the keys, and so we can sort by them, in the same way as we sort in the main algorithm.