feature request: create dataloader for all experiences combined into a single stream without task boundaries
In the task agnostic setting, we should just get a stream of (x,y) pairs. These may be generated by exctraing N examples from a sequence of empirical distributions (experiences), in a piecewise stationary fashion, but this should be hidden from the user. I wrote some code to do this (see below) but I feel this should be a first-class citizen.
def make_avalanche_dataloaders(dataset, ntrain_per_dist, ntest_per_batch, batch_size):
'''Make pytorch dataloaders from avalanche dataset.
ntrain_per_dist: number of training examples from each distribution (experience).
batch_size: how many training examples per batch.
ntest_per_batch: how many test examples per training batch.
'''
train_stream = dataset.train_stream
test_stream = dataset.test_stream
nexperiences = len(train_stream) # num. distinct distributions
nbatches_per_dist = int(ntrain_per_dist / batch_size)
ntest_per_dist = ntest_per_batch * nbatches_per_dist
train_ndx, test_ndx = range(ntrain_per_dist), range(ntest_per_dist)
train_sets = []
test_sets = []
for exp in range(nexperiences):
ds = train_stream[exp].dataset
train_set = torch.utils.data.Subset(ds, train_ndx)
train_sets.append(train_set)
ds = test_stream[exp].dataset
test_set = torch.utils.data.Subset(ds, test_ndx)
test_sets.append(test_set)
train_set = torch.utils.data.ConcatDataset(train_sets)
test_set = torch.utils.data.ConcatDataset(test_sets)
train_dataloader = DataLoader(train_set, batch_size=batch_size, shuffle=False)
test_dataloader = DataLoader(test_set, batch_size=ntest_per_batch, shuffle=False)
return train_dataloader, test_dataloader
You are not the first to ask for it, so I would like to explain why we don't do it.
In Avalanche, streams are sequences of experiences. The content of the experience depends on the problem type, but usually you get a batch of data (AvalancheDataset) and some metadata (task labels and things you need for evaluation). Notice that there is not a one-to-one correspondence between empirical distributions and experiences.
Boundary-free streams in Avalanche are streams of small experiences. Basically, each minibatch is a separate experience. Some advantages over having raw samples are:
- training methods can use the experience metadata
- having an AvalancheDataset instead of raw samples means that training methods can change augmentations more easily
- evaluation methods access the experience's metadata to find which distribution/task generated the current experience.
An additional advantage is that many methods that work in the boundary-free setting can be used in the boundary-aware and batch (large batches) settings without any modifications.
I think we really need to improve the documentation because the difference between experience and task/distribution shift is confusing to many people.
Another thing that we could do is provide a mechanism to extract the stream of samples given the stream of experiences. I guess this would be helpful for people that are using only Avalanche benchmarks and don't need training/evaluation.
Can you point me to any examples of your existing functionality for boundary-free streams? (We are hoping to submit a paper to COLLAS (https://lifelong-ml.cc/Conferences/2023/call) on March 6th, so would like to try this ASAP. Thx.) More generally, we are using JAX, so just want to use avalanche for the datasets.
See @HamedHemati comment in #1306