automlbenchmark icon indicating copy to clipboard operation
automlbenchmark copied to clipboard

Adding capability of taking auxiliary data

Open yinweisu opened this issue 3 years ago • 7 comments

The idea of this PR is to enable benchmark yaml to take in auxiliary data. Auxiliary data could be useful when the task requires more than train and test datasets. For example, images are needed for multimodality tasks. There are more usage possibilities, and therefore we think it's worth to have auxiliary data as a general feature.

Example yaml file:

- name: benchmark
  folds: 1
  dataset:
    target: label
    train: train.csv
    test: test.csv
  auxiliary_data:
    train: train_aux.zip
    test: test_aux.zip  # test aux not required

The main difference between auxiliary data and regular dataset should be that:

  1. We don't require a 1 to 1 correspondence of train and test aux data. Imagine a use case where some psudo label data are used during training, while we don't need those during testing.
  2. AMLB should let the users to handle the aux data themselves because use case of the aux data could vary a lot. AMLB only prepare the data and hand users the path.

yinweisu avatar Dec 07 '21 21:12 yinweisu

There are duplicate code in terms of extracting auxiliary paths because I find it confusing to put the logic of extracting auxiliary and regular train/test paths in a single function. I'm sure there are better designs and feel free to propose them :).

yinweisu avatar Dec 07 '21 21:12 yinweisu

@sebhrusen @PGijsbers

yinweisu avatar Dec 07 '21 21:12 yinweisu

Thank you for the contribution. I just wanted to let you know that it might be a little while before I myself can look at this. Though if Seb finds the time then I'm okay with whatever he says :)

In general I think it would be a useful addition to allow for auxiliary data to be present in a task. These types of tasks have already been the focus of research and AutoML competitions. Not processing auxiliary data in any meaningful way by the benchmark framework is the only way I see it work (given its free form).

PGijsbers avatar Dec 08 '21 10:12 PGijsbers

@sebhrusen Hey Seb, happy holidays! Can you review the code when you have time?

yinweisu avatar Dec 27 '21 19:12 yinweisu

@sebhrusen Hi, can you review this PR when you have time? Thanks!

yinweisu avatar Feb 14 '22 18:02 yinweisu

Hey @yinweisu sorry for making you wait on this: don't have much time for amlb right now, but I'll try to look at your PR during the week. Also, I'll try to work soon with @PGijsbers on restructuring some parts of the code and setting up a workflow for contributors to make this easier for you to contribute and for us to review without being afraid of breaking existing logic. Thanks for your understanding.

sebhrusen avatar Feb 14 '22 18:02 sebhrusen

We have an intern joining for AutoGluon that we want to have work on multi-modal optimization (i.e. datasets with an image feature). Ideally we would like him to be able to use AutoMLBenchmark as the benchmarking tool, but this is tricky without this functionality being merged in. We can make do by hacking it into a forked repo, but just mentioning this.

Innixma avatar May 10 '22 00:05 Innixma