marian-dev icon indicating copy to clipboard operation
marian-dev copied to clipboard

Allow to specify multiple validation sets

Open emjotde opened this issue 6 years ago • 8 comments

It would be useful to be able to collect statistics over multiple validation sets during training. I am not sure how to implement this on the command line, but in the yaml config one could just specify

valid-sets: 
  - [set1.src, set1.tgt]
  - [set2.src, set2.tgt]
  - ...

If valid-sets is just a list, then normal behavior, if lists of lists assume multiple test sets. On the command line this could be a yaml string maybe?

--valid-sets "[[set1.src, set1.tgt], [set2.src, set2.tgt]]"

emjotde avatar Oct 10 '18 15:10 emjotde

I guess --valid-script-path might then require to be paired with the sets, also as a yaml list.

emjotde avatar Oct 10 '18 16:10 emjotde

For the command-line option, we could use a custom format like we do with --devices and, for instance, use a semicolon to separate test sets.

snukky avatar Oct 10 '18 16:10 snukky

I'm waiting for Tom Hoar to complain that he has = or ; in the file name.

kpu avatar Oct 10 '18 16:10 kpu

For the command-line option, we could use a custom format like we do with --devices and, for instance, use a semicolon to separate test sets.

@snukky, that custom format for --devices is gone very soon. The changes I am making for distributed training with NCCL also changed the format of --devices to something simpler. (Basically there is now a second option --num-devices. E.g. --num-devices 2 --devices 0 1 4 6 means that each MPI process uses 2 GPUs, where the first one uses GPUs 0 and 1, and the second one uses 4 and 6. If you only provide num-devices entries such as --num-devices 2 --devices 0 1 then this "broadcasts" such as both MPI processes use their respective GPUs 0 and 1. You can even only say --num-devices 2 which will default to the same. This is normally what one wants. I made that change because once you run on 64 GPUs, that custom --devices string really becomes unmanageable.)

frankseide avatar Oct 10 '18 16:10 frankseide

@kpu, all valid Linux pathnames should be allowed without Marian-custom escaping hoops. (BTW, this also includes regular files named "stdin" and "stdout"...)

@emjotde, if you want to pass a Yaml string, please let's make sure we accept strings that are quoted. E.g.

--valid-sets '[[set1.src, set1.tgt], ["this=great.src", "this=great.tgt"]]'

frankseide avatar Oct 10 '18 16:10 frankseide

@frankseide OK, no problem. I just don't think something like --valid-sets "[[set1.src, set1.tgt], [set2.src, set2.tgt]]" is convenient, but unfortunately don't have a better idea.

snukky avatar Oct 10 '18 16:10 snukky

It's maybe not pretty, but at least consistent. Introducing a third custom syntax seems to be a bad idea.

emjotde avatar Oct 10 '18 16:10 emjotde

Hi there, was this ever taken up? I would also interested in this enhancement!

onadegibert avatar Dec 14 '23 08:12 onadegibert