SDV Add `get_random_subset` poc utility function

CU-86azvqpqe Resolve #1877

A few considerations regarding this PR: 1 - NaNs handling: Currently, I don't drop NaN foreign keys 2 - Randomness: For reproducibility, I set a seed, is it fine? 3 - Index: Should we reset the index of the tables at the end after all the dropping is done? This is also a question for drop_unknown_references. 4 - Verbosity: To be consistent with the other POC methods, I added a verbose parameter to simplify_schema() 5 - Disconnected schema: Subsampling disconnected schema should work with get_random_subset. I wrote down a test where I mocked the metadata validation since disconnected schemas are not supported there.

Thanks for your review and your thoughts on this ;)

Apr 18 '24 12:04 R-Palazzo

Task linked: CU-86azvqpqe SDV - Add get_random_subset poc utility function #1877

Apr 18 '24 12:04 sdv-team

@R-Palazzo just to respond to some of your considerations:

1 - NaNs handling: Currently, I don't drop NaN foreign keys

I think it might make more sense to treat NaNs similarly to how we do with drop_unknown_references and add a flag to indicate if we should drop them or not. As an added consideration, if we don't drop them, maybe should we update the functionality to drop a proportional number of NaN foreign keys to keep the table balanced?

2 - Randomness: For reproducibility, I set a seed, is it fine?

I'm actually not sure we want a fixed seed here, since it might be helpful to re-run the function to get a different subsample. Maybe instead we could control the randomness in a similar way to how we control it when we sample from synthesizers?

3 - Index: Should we reset the index of the tables at the end after all the dropping is done? This is also a question for drop_unknown_references.

In my opinion, I think we can leave the index as-is. We don't use the index, and users might want to be able to compare the subsampled tables back to their original data.

4 - Verbosity: To be consistent with the other POC methods, I added a verbose parameter to simplify_schema()

Nice, I think this works well :)

Apr 22 '24 18:04 frances-h