scvi-tools icon indicating copy to clipboard operation
scvi-tools copied to clipboard

Enable custom splits in `DataSplitter`

Open martinkim0 opened this issue 5 months ago • 0 comments

See #1063. CC @canergen

Two ways we could implement this:

train_indices, validation_indices, and test_indices

  • Need to validate that indices are within range and check for overlaps
  • Easier to adjust splits by changing indices
  • Requires manual tracking of indices

set_split_obs_key that specifies an obs column in the data that should contain the values "train", "validation", and "test"

  • Need to validate that this column only contains the correct values
  • Potentially expensive for disk-backed datasets as we would need to load in this column for all observations prior to training (if we want to perform validation of the values). If a user wants to change the splits, they'll also have to write to disk
  • Ties in splitting logic in with data so that this information is carried along with the dataset downstream
  • One argument vs. 3

This feature will be important for integrating scib-metrics with the training loop as we want good representation of batches and cell types in the validation set.

martinkim0 avatar Feb 22 '24 16:02 martinkim0