marian-dev
marian-dev copied to clipboard
Allow to specify multiple validation sets
It would be useful to be able to collect statistics over multiple validation sets during training. I am not sure how to implement this on the command line, but in the yaml config one could just specify
valid-sets:
- [set1.src, set1.tgt]
- [set2.src, set2.tgt]
- ...
If valid-sets is just a list, then normal behavior, if lists of lists assume multiple test sets. On the command line this could be a yaml string maybe?
--valid-sets "[[set1.src, set1.tgt], [set2.src, set2.tgt]]"
I guess --valid-script-path
might then require to be paired with the sets, also as a yaml list.
For the command-line option, we could use a custom format like we do with --devices
and, for instance, use a semicolon to separate test sets.
I'm waiting for Tom Hoar to complain that he has = or ; in the file name.
For the command-line option, we could use a custom format like we do with
--devices
and, for instance, use a semicolon to separate test sets.
@snukky, that custom format for --devices
is gone very soon. The changes I am making for distributed training with NCCL also changed the format of --devices
to something simpler. (Basically there is now a second option --num-devices
. E.g. --num-devices 2 --devices 0 1 4 6
means that each MPI process uses 2 GPUs, where the first one uses GPUs 0 and 1, and the second one uses 4 and 6. If you only provide num-devices
entries such as --num-devices 2 --devices 0 1
then this "broadcasts" such as both MPI processes use their respective GPUs 0 and 1. You can even only say --num-devices 2
which will default to the same. This is normally what one wants. I made that change because once you run on 64 GPUs, that custom --devices
string really becomes unmanageable.)
@kpu, all valid Linux pathnames should be allowed without Marian-custom escaping hoops. (BTW, this also includes regular files named "stdin" and "stdout"...)
@emjotde, if you want to pass a Yaml string, please let's make sure we accept strings that are quoted. E.g.
--valid-sets '[[set1.src, set1.tgt], ["this=great.src", "this=great.tgt"]]'
@frankseide OK, no problem. I just don't think something like --valid-sets "[[set1.src, set1.tgt], [set2.src, set2.tgt]]"
is convenient, but unfortunately don't have a better idea.
It's maybe not pretty, but at least consistent. Introducing a third custom syntax seems to be a bad idea.
Hi there, was this ever taken up? I would also interested in this enhancement!