ratatool
ratatool copied to clipboard
bigdiffy with datasets that contain duplicates
Plenty of cases where I needed to diff a dataset that contains duplicates. The current process is to output to BQ and manually use a query to count by key.
The only way I could think of comparing this data is via counting by a custom key built from the data itself. Does it make sense that there should be a diff option to just count by key? count on both sides and check all keys match and the counts match. This is not meant to solve all cases.
There needs to be a way to disambiguate between different rows. We could automate generating unique keys but I also think this is a thing that can be manually done upstream, either by generating a new dataset, or importing the BigDiffy library and pre-processing in code.
This is also why we allow generating concatenated keys from multiple columns. Is there a reason that doesn't work for your case?
Yeah, pre-aggregation was also one of the first things on my mind because the suggested approach has very little in common with the current features in ratatool-diffy
. Two ideas around pre-aggregation:
- Pre-aggregate rows by key with some function (e.g. mean) and then diff
- This should have the same result for non-duplicates
- There’s some thought needed for nested records (and for choosing the function in non-numeric cases)
- Sounds like it would ignore too many corner cases
- Pre-aggregate rows by key by creating one additional level of nesting and then diff
- We support arbitrary levels of nesting, so this in theory would work (need to take a closer look at how we handle record types)
There needs to be a way to disambiguate between different rows. We could automate generating unique keys but I also think this is a thing that can be manually done upstream, either by generating a new dataset, or importing the BigDiffy library and pre-processing in code.
This is also why we allow generating concatenated keys from multiple columns. Is there a reason that doesn't work for your case?
Specific case is because there could be multiple records that look exactly the same for all the field values (still valid events). Any combination of columns would yield duplicate here.
Though the pre processing sounds like an option. Just setup a custom diffy job for these rare case where I could just salt the unique key with like a counter or something ({uniquekey}_dup1, {uniquekey}_dup2) and then pass that to the diffy library.
Yea, I believe preprocessing will still be effective there as if all values are the same it doesn't matter if you assign dup1 and dup2 in different ways each run.
The main issue is that we don't want to bake too much logic into the jobs and provide the flexibility of importing as a library for highly specific use cases. If this winds up being very common then I think the discussion is a bit different but we haven't really seen that so far. The jobs are already quick unwieldy as is and would like to not bloat them any further unless needed.