automlbenchmark
automlbenchmark copied to clipboard
Adding capability of taking auxiliary data
The idea of this PR is to enable benchmark yaml to take in auxiliary data. Auxiliary data could be useful when the task requires more than train and test datasets. For example, images are needed for multimodality tasks. There are more usage possibilities, and therefore we think it's worth to have auxiliary data as a general feature.
Example yaml file:
- name: benchmark
folds: 1
dataset:
target: label
train: train.csv
test: test.csv
auxiliary_data:
train: train_aux.zip
test: test_aux.zip # test aux not required
The main difference between auxiliary data and regular dataset should be that:
- We don't require a 1 to 1 correspondence of train and test aux data. Imagine a use case where some psudo label data are used during training, while we don't need those during testing.
- AMLB should let the users to handle the aux data themselves because use case of the aux data could vary a lot. AMLB only prepare the data and hand users the path.
There are duplicate code in terms of extracting auxiliary paths because I find it confusing to put the logic of extracting auxiliary and regular train/test paths in a single function. I'm sure there are better designs and feel free to propose them :).
@sebhrusen @PGijsbers
Thank you for the contribution. I just wanted to let you know that it might be a little while before I myself can look at this. Though if Seb finds the time then I'm okay with whatever he says :)
In general I think it would be a useful addition to allow for auxiliary data to be present in a task. These types of tasks have already been the focus of research and AutoML competitions. Not processing auxiliary data in any meaningful way by the benchmark framework is the only way I see it work (given its free form).
@sebhrusen Hey Seb, happy holidays! Can you review the code when you have time?
@sebhrusen Hi, can you review this PR when you have time? Thanks!
Hey @yinweisu sorry for making you wait on this: don't have much time for amlb
right now, but I'll try to look at your PR during the week.
Also, I'll try to work soon with @PGijsbers on restructuring some parts of the code and setting up a workflow for contributors to make this easier for you to contribute and for us to review without being afraid of breaking existing logic.
Thanks for your understanding.
We have an intern joining for AutoGluon that we want to have work on multi-modal optimization (i.e. datasets with an image feature). Ideally we would like him to be able to use AutoMLBenchmark as the benchmarking tool, but this is tricky without this functionality being merged in. We can make do by hacking it into a forked repo, but just mentioning this.