data-diff
data-diff copied to clipboard
Multiple columns key
I have a database with multiple columns as a primary key. Is there any way to specify multiple columns as a key?
This feature hasn't been implemented yet, but it's definitely something we are going to support.
Is there a place where we can register our support for this feature? Appreciate it!
@yuzeh Here is a good start :)
@yuzeh @vlad-bezden two questions (and for anyone else weighing in wanting this):
- What database(s) are you using, and do they both use a multi-column primary key?
- What column types are in the primary key?
I use Snowflake, with three columns as the primary (unique) key. All three columns are integers.
I use mysql and have data schemas with composite primary keys as well:
CREATE TABLE `payments` (
`customerNumber` int NOT NULL,
`checkNumber` varchar(50) NOT NULL,
`paymentDate` date NOT NULL,
`amount` decimal(10,2) NOT NULL,
PRIMARY KEY (`customerNumber`,`checkNumber`),
CONSTRAINT `payments_ibfk_1` FOREIGN KEY (`customerNumber`) REFERENCES `customers` (`customerNumber`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Taken from https://www.mysqltutorial.org/mysql-sample-database.aspx
To fix we have to add a column but this is not feasible for schemas we don't manage or own:
alter table payments add column id int auto_increment not null, add unique key (`id`);
@avaitla @vlad-bezden @yuzeh The PR for a compound primary key is ready!
You can install it with:
pip install git+https://github.com/datafold/data-diff.git@refs/pull/158/head
And then you can use multiple -k
arguments, for example:
data_diff postgresql:/// table1 postgresql:/// table2 -k id -k id2 -t timestamp
Please give it a try, and let me know if there are any problems!
@vlad-bezden @yuzeh @avaitla I had some time to think about this problem, and I think it should work with the current version (not the new PR), by setting -k
to the column with the most distinct values. You can include the other primary key columns with -c
.
I don't think it will be easy to improve on this method, in terms of performance, and it seems to be the simplest way forward.
Can you give it a try?
If that's doesn't answer your situation, please let me know.
Perfect. Test driving now.
@erezsh Did the test, unfortunately Got this error:
NotImplementedError: Cannot use a column of type Text() as a key
Any thoughts on how to work around?
@sirupsen Missed your question here, apologies!
@yuzeh @vlad-bezden two questions (and for anyone else weighing in wanting this):
- What database(s) are you using, and do they both use a multi-column primary key?
- What column types are in the primary key?
- BigQuery. There's no concept of a primary key in BQ, the closest you could get is to designate a combination of columns as "clustering columns"
- string, timestamp, date
@yuzeh Can you please share the SQL type of the column you use as a key, and a few sample values from that column?
Composite Keys - Analysis of problems, and proposed solutions
Data-diff's performance relies strongly on its ability to split the table rows into (more-or-less) evenly sized segments.
I see two main problems to providing good support of composite keys.
Any comments are welcome.
Problem 1 - Uneven distribution of keys
For a single-column key, we can rely on the assumption that its values are mostly evenly distributed. That is because in most situations, for each row, it increases by a constant value.
This assumption breaks for composite keys. For example, in the composite key of a many-to-many table, some values may repeat themselves much more often than others. In some cases, one might expect a bell-curve distribution, where some foreign keys are much more "popular" than others.
Possible solutions are to:
-
Sample the rows randomly, to establish the likely distribution. It's complicated to implemented, and it's not clear that the cost of sampling will be worth whatever boost we get from more evenly sized segmentation.
-
Recalculate min-max for new segments. It doesn't solve the problem entirely, but it's relatively cheap, and lets us throw away some of the unused values, which will somewhat improve our error-margin.
I think (2) is the right way to go, but better to test it in practice, rather than make assumptions.
While it's not as easy problem to solve, it's much less severe if at least one of the components of the key is evenly distributed, and it's probably unlikely that all the components will be significantly uneven.
Problem 2 - Lexicographic splitting
Even if we assume our composite key is evenly distributed, there is no simple way to split it into evenly-sized segments.
Splitting each component of the key separately, will produce an uneven distribution. (explained here: https://github.com/datafold/data-diff/pull/158#discussion_r922719708)
My proposed solution is to treat the composite key as having a lexicographic order:
Step 1. Find the lowest and highest composite value.
Step 2. Find the bounds of each key component.
Step 3. Split the lexicographic space between min_value to max_value, within the known value bounds. These bounds can be updated for each new segment, as described in solution 2 of problem 1.
Pseudo-code example for a 2-column key (a,b):
min_value = SELECT a, min(b) FROM t WHERE a = (SELECT min(a) FROM t)
max_value = SELECT a, max(b) FROM t WHERE a = (SELECT max(a) FROM t)
min_bound = SELECT min(a), min(b) FROM t
max_bound = SELECT max(a), max(b) FROM t
assert min_bound <= min_value <= max_value <= max_bound
new_segement_splits = split_lexicographic_space(32, min_value, max_value, min_bound, max_bound)
The actual splitting of the lexicographic space isn't trivial either, but here's a pseudo-code implementation of it:
def split_lexicographic_space(split_count: int, min_value: tuple, max_value: tuple, min_bound: tuple, max_bound: tuple):
space = LexicographicSpace(min_bound, max_bound)
size = space.subtract(max_value, min_value)
segment_size = space.divide(size, split_count)
n = segment_size
for i in range(split_count-1):
yield space.add(min_value, n)
n = space.add(n, segment_size)
Once we have new splits for our segments, we can query them using lexicographic ordering.
Here's the code for it, using our new query library:
def lexographic_lowerthan(key1: Sequence[Expr], key2: Sequence[Expr]):
e1 = key1[0]
e2 = key2[0]
if len(key1) == 1:
assert len(key2) == 1
return e1 < e2
return (e1 < e2) | ((e1 == e2) & lexographic_lowerthan(key1[1:], key2[1:]))
...
class TableSegment:
...
def make_select(self):
k = self.key_columns
return self.table.where( lexographic_lowerthan(self.min_key, k) & lexographic_lowerthan(k, self.max_key))
@erezsh Where I work we're suffering from the multiple key issue, and we're floating the idea of trying to extend the data-diff library to support compound keys ourselves. I'm just curious, how far along is this feature?
@GCCree I am close to finishing a new implementation of compound keys. I will probably post the PR next week.
Just FYI for anyone who is interested, the PR is already up: https://github.com/datafold/data-diff/pull/375
If you want to help us out, take it for a spin and let us know how well it worked!
I hope we'll merge it soon, once the code review process is done.
It's been merged a few months ago.
I can't close this issue anymore, but someone should...