avalanche icon indicating copy to clipboard operation
avalanche copied to clipboard

feature request: create dataloader for all experiences combined into a single stream without task boundaries

Open murphyk opened this issue 3 years ago • 3 comments

In the task agnostic setting, we should just get a stream of (x,y) pairs. These may be generated by exctraing N examples from a sequence of empirical distributions (experiences), in a piecewise stationary fashion, but this should be hidden from the user. I wrote some code to do this (see below) but I feel this should be a first-class citizen.

def make_avalanche_dataloaders(dataset, ntrain_per_dist, ntest_per_batch, batch_size):
    '''Make pytorch dataloaders from avalanche dataset.
    ntrain_per_dist: number of training examples from each distribution (experience).
    batch_size: how many training examples per batch.
    ntest_per_batch: how many test examples per training batch.
    '''
    train_stream = dataset.train_stream
    test_stream = dataset.test_stream
    nexperiences = len(train_stream) # num. distinct distributions
    nbatches_per_dist = int(ntrain_per_dist / batch_size)
    ntest_per_dist = ntest_per_batch * nbatches_per_dist
    train_ndx, test_ndx = range(ntrain_per_dist), range(ntest_per_dist)

    train_sets = []
    test_sets = []
    for exp in range(nexperiences):
        ds = train_stream[exp].dataset
        train_set = torch.utils.data.Subset(ds, train_ndx)
        train_sets.append(train_set)

        ds = test_stream[exp].dataset
        test_set = torch.utils.data.Subset(ds, test_ndx)
        test_sets.append(test_set)

    train_set = torch.utils.data.ConcatDataset(train_sets)
    test_set = torch.utils.data.ConcatDataset(test_sets)

    train_dataloader = DataLoader(train_set, batch_size=batch_size, shuffle=False)
    test_dataloader = DataLoader(test_set, batch_size=ntest_per_batch, shuffle=False)
    return train_dataloader, test_dataloader

murphyk avatar Feb 26 '23 18:02 murphyk

You are not the first to ask for it, so I would like to explain why we don't do it.

In Avalanche, streams are sequences of experiences. The content of the experience depends on the problem type, but usually you get a batch of data (AvalancheDataset) and some metadata (task labels and things you need for evaluation). Notice that there is not a one-to-one correspondence between empirical distributions and experiences.

Boundary-free streams in Avalanche are streams of small experiences. Basically, each minibatch is a separate experience. Some advantages over having raw samples are:

  • training methods can use the experience metadata
  • having an AvalancheDataset instead of raw samples means that training methods can change augmentations more easily
  • evaluation methods access the experience's metadata to find which distribution/task generated the current experience.

An additional advantage is that many methods that work in the boundary-free setting can be used in the boundary-aware and batch (large batches) settings without any modifications.

I think we really need to improve the documentation because the difference between experience and task/distribution shift is confusing to many people.

Another thing that we could do is provide a mechanism to extract the stream of samples given the stream of experiences. I guess this would be helpful for people that are using only Avalanche benchmarks and don't need training/evaluation.

AntonioCarta avatar Feb 27 '23 11:02 AntonioCarta

Can you point me to any examples of your existing functionality for boundary-free streams? (We are hoping to submit a paper to COLLAS (https://lifelong-ml.cc/Conferences/2023/call) on March 6th, so would like to try this ASAP. Thx.) More generally, we are using JAX, so just want to use avalanche for the datasets.

murphyk avatar Feb 27 '23 17:02 murphyk

See @HamedHemati comment in #1306

AntonioCarta avatar Feb 28 '23 09:02 AntonioCarta