Improve `Dataset` class
The current Dataset implementation is very limited and does not work nicely with PyTorch, for example.
We should make it closer to the implementation and have a better separation between training and testing data as well as allow passing transforms that will be used on the input samples on the fly.
This is an example of one possible implementation heavily inspired by the PyTorch implementation:
class Dataset:
def __init__(
self,
X: "NDArray",
y: "NDArray",
feature_names: Optional[Iterable[str]] = None,
target_names: Optional[Iterable[str]] = None,
data_names: Optional[Iterable[str]] = None,
description: str = None,
transform: Optional[Callable[["NDArray"], Any]] = None
):
self.X = X
self.y = y
self.feature_names = list(feature_names or [])
self.target_names = list(target_names or [])
self.description = description or "No description"
self.transform = transform
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
x, y = self.X[idx], self.y[idx]
if self.transform:
x = self.transform(x)
return x, y
def __iter__(self):
for i in range(len(self)):
yield self[i]
class Subset(Dataset):
def __init__(
self,
dataset: Dataset,
indices: Sequence[int]
):
self.dataset = dataset
self.indices = indices
def __len__(self):
return len(self.indices)
def __getitem__(self, idx):
return self.dataset[self.indices[idx]]
@property
def feature_names(self):
return self.dataset.feature_names
@property
def target_names(self):
return self.dataset.target_names
@property
def description(self):
return "Subset of: " + self.dataset.description
These classes support access by indices, which is what's required to work with PyTorch's DataLoader class.
We could also implement a helper class for creating train test splits Subset objects:
def train_test_split(dataset: Dataset, test_size: Union[int, float], random_state: Union[int, np.random.RandomState, None] = None) -> Tuple[Subset]:
if isinstance(test_size, int):
size = test_size
else:
size = test_size * len(dataset)
indices = np.arange(len(dataset))
rng = np.random.default_rng(random_state)
rng.shuffle(indices)
test_indices = indices[:test_size]
train_indices = indices[test_size:]
return Subset(dataset, train_indices), Subset(dataset, test_indices)
@Xuzzo What do you think?
Looks good! I would make it be possible to pass also torch tensors (and maybe other frameworks in the future) as x y. I say this cause I have had quite a few issues adjusting the tensor type to the loss function later on in the code and I believe it is better if done directly by the user, as is normally done in pytorch methods. Also, this way you can allocate tensors to other devices e.g. gpus using the tensor.to(device) pytorch syntax, and this would solve a lot of parallelisation problems.
The original purpose of Dataset was to hold the train/test split, plus maybe label names and ids for groups. Do we need to change that? IIRC we agreed in a daily not to make it into something different. If we need interaction with some library we can write adapters. As to @Xuzzo's comment on types, I'm not sure I understand the issue. NDArrays have a dtype which can be used for conversion, as is probably done torch.from_numpy()
@schroedk is this similar to what you suggested?