scvi-tools
scvi-tools copied to clipboard
Enable custom splits in `DataSplitter`
See #1063. CC @canergen
Two ways we could implement this:
train_indices, validation_indices, and test_indices
- Need to validate that indices are within range and check for overlaps
- Easier to adjust splits by changing indices
- Requires manual tracking of indices
set_split_obs_key that specifies an obs column in the data that should contain the values "train", "validation", and "test"
- Need to validate that this column only contains the correct values
- Potentially expensive for disk-backed datasets as we would need to load in this column for all observations prior to training (if we want to perform validation of the values). If a user wants to change the splits, they'll also have to write to disk
- Ties in splitting logic in with data so that this information is carried along with the dataset downstream
- One argument vs. 3
This feature will be important for integrating scib-metrics with the training loop as we want good representation of batches and cell types in the validation set.