data-diff icon indicating copy to clipboard operation
data-diff copied to clipboard

Multiple columns key

Open vlad-bezden opened this issue 2 years ago • 12 comments

I have a database with multiple columns as a primary key. Is there any way to specify multiple columns as a key?

vlad-bezden avatar Jun 27 '22 10:06 vlad-bezden

This feature hasn't been implemented yet, but it's definitely something we are going to support.

erezsh avatar Jun 27 '22 12:06 erezsh

Is there a place where we can register our support for this feature? Appreciate it!

yuzeh avatar Jun 28 '22 05:06 yuzeh

@yuzeh Here is a good start :)

erezsh avatar Jun 28 '22 05:06 erezsh

@yuzeh @vlad-bezden two questions (and for anyone else weighing in wanting this):

  1. What database(s) are you using, and do they both use a multi-column primary key?
  2. What column types are in the primary key?

sirupsen avatar Jun 28 '22 15:06 sirupsen

I use Snowflake, with three columns as the primary (unique) key. All three columns are integers.

vlad-bezden avatar Jun 29 '22 10:06 vlad-bezden

I use mysql and have data schemas with composite primary keys as well:

CREATE TABLE `payments` (
  `customerNumber` int NOT NULL,
  `checkNumber` varchar(50) NOT NULL,
  `paymentDate` date NOT NULL,
  `amount` decimal(10,2) NOT NULL,
  PRIMARY KEY (`customerNumber`,`checkNumber`),
  CONSTRAINT `payments_ibfk_1` FOREIGN KEY (`customerNumber`) REFERENCES `customers` (`customerNumber`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Taken from https://www.mysqltutorial.org/mysql-sample-database.aspx

To fix we have to add a column but this is not feasible for schemas we don't manage or own:

alter table payments add column id int auto_increment not null, add unique key (`id`);

avaitla avatar Jul 02 '22 21:07 avaitla

@avaitla @vlad-bezden @yuzeh The PR for a compound primary key is ready!

You can install it with:

pip install git+https://github.com/datafold/data-diff.git@refs/pull/158/head

And then you can use multiple -k arguments, for example:

data_diff postgresql:/// table1 postgresql:/// table2 -k id -k id2 -t timestamp

Please give it a try, and let me know if there are any problems!

erezsh avatar Jul 12 '22 12:07 erezsh

@vlad-bezden @yuzeh @avaitla I had some time to think about this problem, and I think it should work with the current version (not the new PR), by setting -k to the column with the most distinct values. You can include the other primary key columns with -c.

I don't think it will be easy to improve on this method, in terms of performance, and it seems to be the simplest way forward.

Can you give it a try?

If that's doesn't answer your situation, please let me know.

erezsh avatar Jul 19 '22 16:07 erezsh

Perfect. Test driving now.

yuzeh avatar Jul 31 '22 20:07 yuzeh

@erezsh Did the test, unfortunately Got this error:

NotImplementedError: Cannot use a column of type Text() as a key

Any thoughts on how to work around?

yuzeh avatar Jul 31 '22 21:07 yuzeh

@sirupsen Missed your question here, apologies!

@yuzeh @vlad-bezden two questions (and for anyone else weighing in wanting this):

  1. What database(s) are you using, and do they both use a multi-column primary key?
  2. What column types are in the primary key?
  1. BigQuery. There's no concept of a primary key in BQ, the closest you could get is to designate a combination of columns as "clustering columns"
  2. string, timestamp, date

yuzeh avatar Jul 31 '22 21:07 yuzeh

@yuzeh Can you please share the SQL type of the column you use as a key, and a few sample values from that column?

erezsh avatar Aug 01 '22 07:08 erezsh

Composite Keys - Analysis of problems, and proposed solutions

Data-diff's performance relies strongly on its ability to split the table rows into (more-or-less) evenly sized segments.

I see two main problems to providing good support of composite keys.

Any comments are welcome.

Problem 1 - Uneven distribution of keys

For a single-column key, we can rely on the assumption that its values are mostly evenly distributed. That is because in most situations, for each row, it increases by a constant value.

This assumption breaks for composite keys. For example, in the composite key of a many-to-many table, some values may repeat themselves much more often than others. In some cases, one might expect a bell-curve distribution, where some foreign keys are much more "popular" than others.

Possible solutions are to:

  1. Sample the rows randomly, to establish the likely distribution. It's complicated to implemented, and it's not clear that the cost of sampling will be worth whatever boost we get from more evenly sized segmentation.

  2. Recalculate min-max for new segments. It doesn't solve the problem entirely, but it's relatively cheap, and lets us throw away some of the unused values, which will somewhat improve our error-margin.

I think (2) is the right way to go, but better to test it in practice, rather than make assumptions.

While it's not as easy problem to solve, it's much less severe if at least one of the components of the key is evenly distributed, and it's probably unlikely that all the components will be significantly uneven.

Problem 2 - Lexicographic splitting

Even if we assume our composite key is evenly distributed, there is no simple way to split it into evenly-sized segments.

Splitting each component of the key separately, will produce an uneven distribution. (explained here: https://github.com/datafold/data-diff/pull/158#discussion_r922719708)

My proposed solution is to treat the composite key as having a lexicographic order:

Step 1. Find the lowest and highest composite value.

Step 2. Find the bounds of each key component.

Step 3. Split the lexicographic space between min_value to max_value, within the known value bounds. These bounds can be updated for each new segment, as described in solution 2 of problem 1.

Pseudo-code example for a 2-column key (a,b):

min_value = SELECT a, min(b) FROM t WHERE a = (SELECT min(a) FROM t)
max_value = SELECT a, max(b) FROM t WHERE a = (SELECT max(a) FROM t)
min_bound = SELECT min(a), min(b) FROM t
max_bound = SELECT max(a), max(b) FROM t

assert min_bound <= min_value <= max_value <= max_bound

new_segement_splits = split_lexicographic_space(32, min_value, max_value, min_bound, max_bound)

The actual splitting of the lexicographic space isn't trivial either, but here's a pseudo-code implementation of it:

def split_lexicographic_space(split_count: int, min_value: tuple, max_value: tuple, min_bound: tuple, max_bound: tuple):
    space = LexicographicSpace(min_bound, max_bound)
    size = space.subtract(max_value, min_value)
    segment_size = space.divide(size, split_count)
    n = segment_size
    for i in range(split_count-1):
        yield space.add(min_value, n)
        n = space.add(n, segment_size)

Once we have new splits for our segments, we can query them using lexicographic ordering.

Here's the code for it, using our new query library:

def lexographic_lowerthan(key1: Sequence[Expr], key2: Sequence[Expr]):
    e1 = key1[0]
    e2 = key2[0]

    if len(key1) == 1:
        assert len(key2) == 1
        return e1 < e2

    return (e1 < e2) | ((e1 == e2) & lexographic_lowerthan(key1[1:], key2[1:]))

...

class TableSegment:
    ...
    def make_select(self):
        k = self.key_columns
        return self.table.where( lexographic_lowerthan(self.min_key, k) & lexographic_lowerthan(k, self.max_key))

erezsh avatar Oct 06 '22 15:10 erezsh

@erezsh Where I work we're suffering from the multiple key issue, and we're floating the idea of trying to extend the data-diff library to support compound keys ourselves. I'm just curious, how far along is this feature?

GCCree avatar Jan 25 '23 03:01 GCCree

@GCCree I am close to finishing a new implementation of compound keys. I will probably post the PR next week.

erezsh avatar Jan 25 '23 12:01 erezsh

Just FYI for anyone who is interested, the PR is already up: https://github.com/datafold/data-diff/pull/375

If you want to help us out, take it for a spin and let us know how well it worked!

I hope we'll merge it soon, once the code review process is done.

erezsh avatar Feb 24 '23 14:02 erezsh

It's been merged a few months ago.

I can't close this issue anymore, but someone should...

erezsh avatar May 19 '23 08:05 erezsh