ducttape
ducttape copied to clipboard
Allow branchpoints containing variable blocks
We would also like to achieve nested branch points as an alternative to config blocks, making it easy to define language pairs as branch points within the same workflow, something along the lines of:
config {
branchpoint (Lang: en) {
lm_text=en.txt
N=5
}
branchpoint (Lang: ar) {
branchpoint (LmSize: small) {
lm_text=ar_small.txt
N=3
}
branchpoint (LmSize: big) {
lm_text=ar_big.txt
N=4
}
}
branchpoint (LangPair: en-ar) {
# build_lm is defined somewhere in the workflow... (is this bad?)
arpa=@build_lm[Lang:ar]
parallel_text=en-ar.txt
}
branchpoint (LangPair: zh-en) {
arpa=@build_lm[Lang:en]
parallel_text=zh-en.txt
}
branchpoint (LangPair: ar-en) {
arpa=@build_lm[Lang:en]
parallel_text=ar-en.txt
}
However, in this config example, we don't actually define tasks with code -- only ConfigAssignments, which is rather unlike the switch-case example above. This still needs to be reconciled (or simply left as 2 separate constructs).
(See also chat, june 14, 11:01am)
Lane suggests using "branch" instead of "branchpoint"
Here is my original motivating example for wanting some mechanism to group parameters into a branch. Assume a mature tape that is now more or less unchanging. All changeable elements are meant to go in a config file. The tape contains tasks that references variables that must be defined in the config file, like so:
task match
> dev
> test
:: dev_translations=${dev_translations}
:: test_translations=${test_translations}
< match_script=$match_script
{
${match_script} ${dev_translations} > dev
${match_script} ${test_translations} > test
}
Here, we assume that ${match_script} is defined in a global block in the tape file, and ${dev_translations} and ${test_translations} must be defined in the config file.
My original attempt to solve this was as follows:
config exp_iwslt11_talk_ef {
lm_file="/experiments/models/order-5.srilm"
dev_references="/experiments/tests/iwslt11-ef/data/dev.fr.tok.lc"
test_references="/experiments/tests/iwslt11-ef/data/tst.fr.tok.lc"
dev_translations="/experiments/exp-iwslt11-talk-ef/dev*.out"
test_translations="/experiments/exp-iwslt11-talk-ef/optimized*.out"
}
My assumption here was that this named config would define something like a branch, where all of the variables defined within the named branch would have the same "branch" value of "exp_iwslt11_talk_ef". Jon pointed out that this doesn't work, and to get this behavior I would need to write a config file like so:
config {
lm_file=(DataSet:
exp_iwslt11_talk_ef="/experiments/models/order-5.srilm"
)
dev_references=(DataSet:
exp_iwslt11_talk_ef="/experiments/tests/iwslt11-ef/data/dev.fr.tok.lc"
)
test_references=(DataSet:
exp_iwslt11_talk_ef="/experiments/tests/iwslt11-ef/data/tst.fr.tok.lc"
)
dev_translations=(DataSet:
exp_iwslt11_talk_ef="/experiments/exp-iwslt11-talk-ef/dev*.out"
)
test_translations=(DataSet:
exp_iwslt11_talk_ef="/experiments/exp-iwslt11-talk-ef/optimized*.out"
)
}
When I wanted to run a set of experiments using different configs, I could do so in a single config file like so:
config {
lm_file=(DataSet:
exp_iwslt09="/experiments/models/order-5.srilm"
exp_iwslt09_dunk="/experiments/models/order-5.srilm"
exp_x5_iwslt09="/experiments/models/order-5.srilm"
exp_x5_iwslt09_dunk="/experiments/models/order-5.srilm"
exp_x10_iwslt09="/experiments/models/order-5.srilm"
exp_x10_iwslt09_dunk="/experiments/models/order-5.srilm"
)
dev_references=(DataSet:
exp_iwslt09="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
exp_iwslt09_dunk="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
exp_x5_iwslt09="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
exp_x5_iwslt09_dunk="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
exp_x10_iwslt09="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
exp_x10_iwslt09_dunk="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
)
test_references=(DataSet:
exp_iwslt09="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
exp_iwslt09_dunk="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
exp_x5_iwslt09="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
exp_x5_iwslt09_dunk="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
exp_x10_iwslt09="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
exp_x10_iwslt09_dunk="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
)
dev_translations=(DataSet:
exp_iwslt09="/experiments/exp-iwslt09/dev*.out"
exp_iwslt09_dunk="/experiments/exp-iwslt09-dunk/dev*.out"
exp_x5_iwslt09="/experiments/exp-x5-iwslt09/dev*.out"
exp_x5_iwslt09_dunk="/experiments/exp-x5-iwslt09-dunk/dev*.out"
exp_x10_iwslt09="/experiments/exp-x10-iwslt09/dev*.out"
exp_x10_iwslt09_dunk="/experiments/exp-x10-iwslt09-dunk/dev*.out"
)
test_translations=(DataSet:
exp_iwslt09="/experiments/exp-iwslt09/optimized*.out"
exp_iwslt09_dunk="/experiments/exp-iwslt09-dunk/optimized*.out"
exp_x5_iwslt09="/experiments/exp-x5-iwslt09/optimized*.out"
exp_x5_iwslt09_dunk="/experiments/exp-x5-iwslt09-dunk/optimized*.out"
exp_x10_iwslt09="/experiments/exp-x10-iwslt09/optimized*.out"
exp_x10_iwslt09_dunk="/experiments/exp-x10-iwslt09-dunk/optimized*.out"
)
}
For me, successful resolution of this issue should ideally allow me to write something very similar to my original attempt at a named config. That is, I should be able to very concisely be able to specify that a set of variable definitions "go together." Likewise, it should be possible to define a set of experimental configurations in a single config file (like the final example above) more concisely than I am now able to.
It would be nice to be able to do something like this:
config {
lm_file="/experiments/models/order-5.srilm"
dev_references="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
test_references="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
branch on dev_translations {
exp_iwslt09="/experiments/exp-iwslt09/dev*.out"
exp_iwslt09_dunk="/experiments/exp-iwslt09-dunk/dev*.out"
exp_x5_iwslt09="/experiments/exp-x5-iwslt09/dev*.out"
exp_x5_iwslt09_dunk="/experiments/exp-x5-iwslt09-dunk/dev*.out"
exp_x10_iwslt09="/experiments/exp-x10-iwslt09/dev*.out"
exp_x10_iwslt09_dunk="/experiments/exp-x10-iwslt09-dunk/dev*.out"
}
branch on test_translations {
exp_iwslt09="/experiments/exp-iwslt09/optimized*.out"
exp_iwslt09_dunk="/experiments/exp-iwslt09-dunk/optimized*.out"
exp_x5_iwslt09="/experiments/exp-x5-iwslt09/optimized*.out"
exp_x5_iwslt09_dunk="/experiments/exp-x5-iwslt09-dunk/optimized*.out"
exp_x10_iwslt09="/experiments/exp-x10-iwslt09/optimized*.out"
exp_x10_iwslt09_dunk="/experiments/exp-x10-iwslt09-dunk/optimized*.out"
}
}
This can be accomplished similarly under current syntax as:
global {
lm_file="/experiments/models/order-5.srilm"
dev_references="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
test_references="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
dev_translations=(Exp:
exp_iwslt09="/experiments/exp-iwslt09/dev*.out"
exp_iwslt09_dunk="/experiments/exp-iwslt09-dunk/dev*.out"
exp_x5_iwslt09="/experiments/exp-x5-iwslt09/dev*.out"
exp_x5_iwslt09_dunk="/experiments/exp-x5-iwslt09-dunk/dev*.out"
exp_x10_iwslt09="/experiments/exp-x10-iwslt09/dev*.out"
exp_x10_iwslt09_dunk="/experiments/exp-x10-iwslt09-dunk/dev*.out"
)
test_translations=(Exp:
exp_iwslt09="/experiments/exp-iwslt09/optimized*.out"
exp_iwslt09_dunk="/experiments/exp-iwslt09-dunk/optimized*.out"
exp_x5_iwslt09="/experiments/exp-x5-iwslt09/optimized*.out"
exp_x5_iwslt09_dunk="/experiments/exp-x5-iwslt09-dunk/optimized*.out"
exp_x10_iwslt09="/experiments/exp-x10-iwslt09/optimized*.out"
exp_x10_iwslt09_dunk="/experiments/exp-x10-iwslt09-dunk/optimized*.out"
)
}
However, we'd also like to be able to break each experiment out into a separate config file. This effectively inverts how the nesting from above is performed. Notice how this will become increasingly attractice as the number of key=value pairs inside each "Exp" block increases.
global {
lm_file="/experiments/models/order-5.srilm"
dev_references="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
test_references="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
branch (Exp: exp_iwslt09) {
dev_translations="/experiments/exp-iwslt09/dev*.out"
test_translations="/experiments/exp-iwslt09/optimized*.out"
}
branch (Exp: exp_iwslt09_dunk) {
dev_translations="/experiments/exp-iwslt09-dunk/dev*.out"
test_translations="/experiments/exp-iwslt09-dunk/optimized*.out"
}
branch (Exp: exp_x5_iwslt09) {
dev_translations="/experiments/exp-x5-iwslt09/dev*.out"
test_translations="/experiments/exp-x5-iwslt09/optimized*.out"
}
branch (Exp: exp_x5_iwslt09_dunk) {
dev_translations="/experiments/exp-x5-iwslt09-dunk/dev*.out"
test_translations="/experiments/exp-x5-iwslt09-dunk/optimized*.out"
}
branch (Exp: exp_x10_iwslt09) {
dev_translations="/experiments/exp-x10-iwslt09/dev*.out"
test_translations="/experiments/exp-x10-iwslt09/optimized*.out"
}
branch (Exp: exp_x10_iwslt09_dunk) {
dev_translations="/experiments/exp-x10-iwslt09-dunk/dev*.out"
test_translations="/experiments/exp-x10-iwslt09-dunk/optimized*.out"
}
}
Now, each "branch" block could be broken out into its own file. This will become an even more powerful feature once we implement an import mechanism.
Should branch really go inside global?
Assuming we do away with "config", yes. Currently "branch" could also go inside "config" as well.
The question was more, should branch be a standalone block rather than inside something else?
Seems reasonable for it to go inside or outside of global, really. I would support either or both, if you have some preference for one of them.
I'm not sure yet.
This will involve changing both the AST parser and TaskTemplateBuilder.