pyDVL Improve `Dataset` class

The current Dataset implementation is very limited and does not work nicely with PyTorch, for example.

We should make it closer to the implementation and have a better separation between training and testing data as well as allow passing transforms that will be used on the input samples on the fly.

This is an example of one possible implementation heavily inspired by the PyTorch implementation:

class Dataset:
   def __init__(
       self, 
       X: "NDArray", 
       y: "NDArray",
       feature_names: Optional[Iterable[str]] = None,
       target_names: Optional[Iterable[str]] = None,
       data_names: Optional[Iterable[str]] = None,
       description: str = None,
       transform: Optional[Callable[["NDArray"], Any]] = None
   ):
       self.X = X
       self.y = y
       self.feature_names = list(feature_names or [])
       self.target_names = list(target_names or [])
       self.description = description or "No description"
       self.transform = transform

   def __len__(self):
       return len(self.X)

   def __getitem__(self, idx):
       x, y = self.X[idx], self.y[idx]
       if self.transform:
           x = self.transform(x)
       return x, y

   def __iter__(self):
       for i in range(len(self)):
           yield self[i]

class Subset(Dataset):
    def __init__(
         self, 
         dataset: Dataset, 
         indices: Sequence[int]
    ):
         self.dataset = dataset
         self.indices = indices

   def __len__(self):
         return len(self.indices)

    def __getitem__(self, idx):
         return self.dataset[self.indices[idx]]

    @property
    def feature_names(self):
         return self.dataset.feature_names

    @property
    def target_names(self):
         return self.dataset.target_names

    @property
    def description(self):
         return "Subset of: " + self.dataset.description

These classes support access by indices, which is what's required to work with PyTorch's DataLoader class.

We could also implement a helper class for creating train test splits Subset objects:

def train_test_split(dataset: Dataset, test_size: Union[int, float], random_state: Union[int, np.random.RandomState, None] = None) -> Tuple[Subset]:
    if isinstance(test_size, int):
        size = test_size
    else:
        size = test_size * len(dataset)
    indices = np.arange(len(dataset))
    rng = np.random.default_rng(random_state)
    rng.shuffle(indices)
    test_indices = indices[:test_size]
    train_indices = indices[test_size:]

    return Subset(dataset, train_indices), Subset(dataset, test_indices)

Sep 08 '22 19:09 AnesBenmerzoug

@Xuzzo What do you think?

Sep 08 '22 19:09 AnesBenmerzoug

Looks good! I would make it be possible to pass also torch tensors (and maybe other frameworks in the future) as x y. I say this cause I have had quite a few issues adjusting the tensor type to the loss function later on in the code and I believe it is better if done directly by the user, as is normally done in pytorch methods. Also, this way you can allocate tensors to other devices e.g. gpus using the tensor.to(device) pytorch syntax, and this would solve a lot of parallelisation problems.

Sep 09 '22 06:09 Xuzzo

The original purpose of Dataset was to hold the train/test split, plus maybe label names and ids for groups. Do we need to change that? IIRC we agreed in a daily not to make it into something different. If we need interaction with some library we can write adapters. As to @Xuzzo's comment on types, I'm not sure I understand the issue. NDArrays have a dtype which can be used for conversion, as is probably done torch.from_numpy()

Oct 07 '22 16:10 mdbenito

@schroedk is this similar to what you suggested?

Mar 18 '24 11:03 AnesBenmerzoug