Add `get_random_subset` poc utility function
CU-86azvqpqe Resolve #1877
A few considerations regarding this PR:
1 - NaNs handling: Currently, I don't drop NaN foreign keys
2 - Randomness: For reproducibility, I set a seed, is it fine?
3 - Index: Should we reset the index of the tables at the end after all the dropping is done? This is also a question for drop_unknown_references.
4 - Verbosity: To be consistent with the other POC methods, I added a verbose parameter to simplify_schema()
5 - Disconnected schema: Subsampling disconnected schema should work with get_random_subset. I wrote down a test where I mocked the metadata validation since disconnected schemas are not supported there.
Thanks for your review and your thoughts on this ;)
@R-Palazzo just to respond to some of your considerations:
1 - NaNs handling: Currently, I don't drop NaN foreign keys
I think it might make more sense to treat NaNs similarly to how we do with drop_unknown_references and add a flag to indicate if we should drop them or not. As an added consideration, if we don't drop them, maybe should we update the functionality to drop a proportional number of NaN foreign keys to keep the table balanced?
2 - Randomness: For reproducibility, I set a seed, is it fine?
I'm actually not sure we want a fixed seed here, since it might be helpful to re-run the function to get a different subsample. Maybe instead we could control the randomness in a similar way to how we control it when we sample from synthesizers?
3 - Index: Should we reset the index of the tables at the end after all the dropping is done? This is also a question for drop_unknown_references.
In my opinion, I think we can leave the index as-is. We don't use the index, and users might want to be able to compare the subsampled tables back to their original data.
4 - Verbosity: To be consistent with the other POC methods, I added a verbose parameter to simplify_schema()
Nice, I think this works well :)