asteroid
asteroid copied to clipboard
conf.yaml options vs. run.sh options
In the WHAM ConvTasNet scripts, you can set some options in conf.yaml and some options in run.sh. The run.sh ones seem to have precedence over the conf.yaml ones.
To me it's confusing since I do not see the reason for two places to specify these things. In practice, I never use the run.sh ones since I want to keep multiple model configurations anyways, so I'll end up having multiple conf.yaml files.
My suggestion is to remove the options from run.sh and add a new flag to run.sh, say --conf, that is a path to a conf.yaml file. This way it's obvious where the config is coming from and also you can easily switch between multiple configs.
Me and @mpariente discussed about this since the beginning of Asteroid.
Basically now it is super flexible, but in my opinion it is also confusing.
In order to run the full recipe you have however to pass also paths to where.json data is saved which can be modified by the user in run.sh. ( otherwise the user will have to modify paths both in the yaml and run.sh)
My suggestion is to keep the paths in run.sh and every hyperpar in the yaml file. Paths will be common to multiple runs so one just specify them in run.sh and then moves to the yaml to modify hyperpars for successive runs.
Yes, I agree that this is still confusing and not optimal and we discussed it several times..
One of the most important feature to have IMO is to be able to change the hparams of an experiment from the command line so that you can use a for loop to launch experiments with different hparams, without needing to duplicate your conf.yml file.
I personally never duplicate conf files, but we can change the logic to make this simpler.
@jonashaag can you have a look at this PR (#104) and tell me what you think? I think this is close to what I'd like:
- There are no duplicates between bash and yaml
- Every yml field can be modified from the command line
- You could easily pass a config file to start from I guess
The only thing that holds me to do that is that it's "complicated" and the code can be confusing..
Looks good to me, but I wonder why roll our own? We could use something like https://github.com/bw2/ConfigArgParse instead (never used it but looks good).
For my use cases, I think best case would be something along the lines of:
- Have a single "model command runner" Python file that knows how to preprocess, train, eval, infer for a certain model. So, more or less a mixture of
run.sh,train.pyandeval.pywith the addition of inference (e.g. directly from.wavfiles). - The runner file reads a model YAML file (like
conf.yamlbut without the strictly "localhost" things like data paths). The values read from the YAML file can be changed using CLI args. - The runner file also parses "localhost" things like data paths from CLI args. These do not make sense in the model YAML file.
- Get rid of
run.shentirely, since then everything is covered by the Python script, and this way we also do not have to deal with arg parsing that is consistent in Bash and Python code. - (Optional: Move dataset preprocessing code to the dataset code itself, since it's duplicated for any model/dataset combination at the moment)
Also thought about removing the run.sh file completely and replace it by an omnipotent Python file.
And yes, in this case, argument parsing gets much simpler.
It's also related to #183, because unit testing python files would be easier. And we could refactor the evaluation to be part of the lightning module.
Would you be willing to put some time into this @jonashaag ? There are lot of things to improve on asteroid's code, particularly things which are recipe-related IMO and your feedbacks are well appreciated thanks !
@popcornell What would you say about moving full python?
You would think about something like this?
python run.py dataprep # run dataprep train and eval (default)
python run.py train --someargs # run train and eval
python run.py eval --otherargs # run eval only
python run.py predict --files_glob # perform inference
IDK for simple things full python is nice. But what about if we have to check some dependencies in one recipe and pip install them. An advantage of going full python is debugging. Easier especially with an IDE.
Sure there are workarounds but they are not immediate as they are in bash. Also data download is not immediate. You have always to use subprocess.
And this comes from one who is not a bash super fan. I prefer to avoid it when I can but for some things it is the best tool.
@popcornell is right, parts of data preparation is a mess in python, how about?
- Centralized bash script for each dataset, registered as a package script (accessible everywhere)
- Python wrapper to that script, with subprocess, inside asteroid.data.dataset_name
- Recipe in python just calling asteroid.data.librimix.dataprep in the
dataprepmode
If we want to rewrite everything in Python for the dataprep later, we also can.
I will try to spend some time on understanding the similarities and differences of the models and datasets and come up with a concept