automlbenchmark Cache Framework Dependencies in Github Workflows

The Github Workflow which runs the validation tests on the frameworks takes a long time, in part because of installation time (in particular for R packages). I think with careful caching, we can avoid the installation step and mount a cache instead whenever the framework installation has not changed. Let me know if this is a bad idea or I missed something.

I think we need to cache the following folders:

FRAMEWORK/venv (Python)
FRAMEWORK/packages (R)

The FRAMEWORK/.installed file probably has to be generated.

We can load the installation from cache only if all of the following files are not changed (all paths relative to automlbenchmark/frameworks):

shared/setup.sh
shared/requirements.txt
FRAMEWORK/setup.sh
FRAMEWORK/requirements.txt

Jun 11 '21 13:06 PGijsbers

I'm not familiar with cache possibilities in GitHub workflow, but this looks like a useful improvement.

The FRAMEWORK/.installed file probably has to be generated.

Not sure, I would cache it with the rest as it's mainly keeping the version of the library (available in the venv/packages). For example, it could be copied inside venv when creating the cache, and copied back to the framework folder when mounting this cache.

Or, much simpler, much cleaner, we could just write all those generated files for various setups (there are a few of them: .setup_env, .installed, Dockerfile, Singularityfile) into {framework}/.setup and this way you can easily include this folder to the cache. It would also remove this "pollution" from the framework folder.

Jun 11 '21 15:06 sebhrusen

Excellent suggestion, created a separate issue for that.

I'm not familiar with cache possibilities in GitHub workflow, but this looks like a useful improvement.

Reading more actually suggests only ~5GB cache is available, so it's probably not going to be possible to cache installations for all frameworks. Maybe there's a way to circumvent this limit (e.g. hosting the cache) or using different Actions, but I guess it requires a closer look.

Jun 14 '21 09:06 PGijsbers

On a related note, what do you think of caching the input directory/openml cache to avoid any contact with the OpenML server? Since the actual communication with the server should be tested and maintained by openml-python, I don't see the problem with only relying on cached data for the benchmark CI. Seems all upside to me (marginally faster, less reliance on external factors).

Jun 14 '21 09:06 PGijsbers

what do you think of caching the input directory/openml cache to avoid any contact with the OpenML server

Good idea, we don't use many datasets for the workflows, so we could as well keep them on the fs. Ideally, we just need to get this cache cleaned up when we upgrade openml (or each time the requirements are changed if it's easier).

Jun 14 '21 14:06 sebhrusen

The cache is currently already configured to be invalidated when requirements.txt changes. Though making the dataset cache openml-python version dependent shouldn't be too hard.

Jun 14 '21 14:06 PGijsbers

automlbenchmark automlbenchmark copied to clipboard

Cache Framework Dependencies in Github Workflows

automlbenchmark
automlbenchmark copied to clipboard