ducttape icon indicating copy to clipboard operation
ducttape copied to clipboard

Allow branchpoints containing variable blocks

Open jhclark opened this issue 12 years ago • 11 comments

We would also like to achieve nested branch points as an alternative to config blocks, making it easy to define language pairs as branch points within the same workflow, something along the lines of:

config {
  branchpoint (Lang: en) {
    lm_text=en.txt
    N=5
  }

  branchpoint (Lang: ar) {
    branchpoint (LmSize: small) {
      lm_text=ar_small.txt
      N=3
    }
    branchpoint (LmSize: big) {
      lm_text=ar_big.txt
      N=4
    }
  }

  branchpoint (LangPair: en-ar) {
    # build_lm is defined somewhere in the workflow... (is this bad?)
    arpa=@build_lm[Lang:ar]
    parallel_text=en-ar.txt
  }

  branchpoint (LangPair: zh-en) {
    arpa=@build_lm[Lang:en]
    parallel_text=zh-en.txt
  }

  branchpoint (LangPair: ar-en) {
    arpa=@build_lm[Lang:en]
    parallel_text=ar-en.txt
  }

However, in this config example, we don't actually define tasks with code -- only ConfigAssignments, which is rather unlike the switch-case example above. This still needs to be reconciled (or simply left as 2 separate constructs).

(See also chat, june 14, 11:01am)

jhclark avatar Jul 11 '12 20:07 jhclark

Lane suggests using "branch" instead of "branchpoint"

jhclark avatar Jul 11 '12 20:07 jhclark

Here is my original motivating example for wanting some mechanism to group parameters into a branch. Assume a mature tape that is now more or less unchanging. All changeable elements are meant to go in a config file. The tape contains tasks that references variables that must be defined in the config file, like so:

task match 
     > dev
     > test
    :: dev_translations=${dev_translations}
    :: test_translations=${test_translations} 
     < match_script=$match_script
{
    ${match_script} ${dev_translations} > dev
    ${match_script} ${test_translations} > test
}

Here, we assume that ${match_script} is defined in a global block in the tape file, and ${dev_translations} and ${test_translations} must be defined in the config file.

My original attempt to solve this was as follows:

config exp_iwslt11_talk_ef {

  lm_file="/experiments/models/order-5.srilm"
  dev_references="/experiments/tests/iwslt11-ef/data/dev.fr.tok.lc"
  test_references="/experiments/tests/iwslt11-ef/data/tst.fr.tok.lc"
  dev_translations="/experiments/exp-iwslt11-talk-ef/dev*.out"
  test_translations="/experiments/exp-iwslt11-talk-ef/optimized*.out"

}

My assumption here was that this named config would define something like a branch, where all of the variables defined within the named branch would have the same "branch" value of "exp_iwslt11_talk_ef". Jon pointed out that this doesn't work, and to get this behavior I would need to write a config file like so:

config {

  lm_file=(DataSet:
    exp_iwslt11_talk_ef="/experiments/models/order-5.srilm"
  )

  dev_references=(DataSet:
    exp_iwslt11_talk_ef="/experiments/tests/iwslt11-ef/data/dev.fr.tok.lc"
  )

  test_references=(DataSet:
    exp_iwslt11_talk_ef="/experiments/tests/iwslt11-ef/data/tst.fr.tok.lc"
  )

  dev_translations=(DataSet:
    exp_iwslt11_talk_ef="/experiments/exp-iwslt11-talk-ef/dev*.out"
  )

  test_translations=(DataSet:
    exp_iwslt11_talk_ef="/experiments/exp-iwslt11-talk-ef/optimized*.out"
  )

}

When I wanted to run a set of experiments using different configs, I could do so in a single config file like so:

config {

  lm_file=(DataSet:
           exp_iwslt09="/experiments/models/order-5.srilm"
      exp_iwslt09_dunk="/experiments/models/order-5.srilm"
        exp_x5_iwslt09="/experiments/models/order-5.srilm"
       exp_x5_iwslt09_dunk="/experiments/models/order-5.srilm"
       exp_x10_iwslt09="/experiments/models/order-5.srilm"
      exp_x10_iwslt09_dunk="/experiments/models/order-5.srilm"
  )

  dev_references=(DataSet:
           exp_iwslt09="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
      exp_iwslt09_dunk="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
        exp_x5_iwslt09="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
       exp_x5_iwslt09_dunk="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
       exp_x10_iwslt09="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
      exp_x10_iwslt09_dunk="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
  )

  test_references=(DataSet:
           exp_iwslt09="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
      exp_iwslt09_dunk="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
        exp_x5_iwslt09="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
       exp_x5_iwslt09_dunk="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
       exp_x10_iwslt09="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
      exp_x10_iwslt09_dunk="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"
  )

  dev_translations=(DataSet:
           exp_iwslt09="/experiments/exp-iwslt09/dev*.out"
      exp_iwslt09_dunk="/experiments/exp-iwslt09-dunk/dev*.out"
        exp_x5_iwslt09="/experiments/exp-x5-iwslt09/dev*.out"
       exp_x5_iwslt09_dunk="/experiments/exp-x5-iwslt09-dunk/dev*.out"
       exp_x10_iwslt09="/experiments/exp-x10-iwslt09/dev*.out"
      exp_x10_iwslt09_dunk="/experiments/exp-x10-iwslt09-dunk/dev*.out"
  )

  test_translations=(DataSet:
           exp_iwslt09="/experiments/exp-iwslt09/optimized*.out"
      exp_iwslt09_dunk="/experiments/exp-iwslt09-dunk/optimized*.out"
        exp_x5_iwslt09="/experiments/exp-x5-iwslt09/optimized*.out"
       exp_x5_iwslt09_dunk="/experiments/exp-x5-iwslt09-dunk/optimized*.out"
       exp_x10_iwslt09="/experiments/exp-x10-iwslt09/optimized*.out"
      exp_x10_iwslt09_dunk="/experiments/exp-x10-iwslt09-dunk/optimized*.out"
  )

}

For me, successful resolution of this issue should ideally allow me to write something very similar to my original attempt at a named config. That is, I should be able to very concisely be able to specify that a set of variable definitions "go together." Likewise, it should be possible to define a set of experimental configurations in a single config file (like the final example above) more concisely than I am now able to.

dowobeha avatar Jul 12 '12 13:07 dowobeha

It would be nice to be able to do something like this:

config {

  lm_file="/experiments/models/order-5.srilm"

  dev_references="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"

  test_references="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"

  branch on dev_translations {
           exp_iwslt09="/experiments/exp-iwslt09/dev*.out"
      exp_iwslt09_dunk="/experiments/exp-iwslt09-dunk/dev*.out"
        exp_x5_iwslt09="/experiments/exp-x5-iwslt09/dev*.out"
       exp_x5_iwslt09_dunk="/experiments/exp-x5-iwslt09-dunk/dev*.out"
       exp_x10_iwslt09="/experiments/exp-x10-iwslt09/dev*.out"
      exp_x10_iwslt09_dunk="/experiments/exp-x10-iwslt09-dunk/dev*.out"
  }

  branch on test_translations {
           exp_iwslt09="/experiments/exp-iwslt09/optimized*.out"
      exp_iwslt09_dunk="/experiments/exp-iwslt09-dunk/optimized*.out"
        exp_x5_iwslt09="/experiments/exp-x5-iwslt09/optimized*.out"
       exp_x5_iwslt09_dunk="/experiments/exp-x5-iwslt09-dunk/optimized*.out"
       exp_x10_iwslt09="/experiments/exp-x10-iwslt09/optimized*.out"
      exp_x10_iwslt09_dunk="/experiments/exp-x10-iwslt09-dunk/optimized*.out"
  }

}

dowobeha avatar Jul 12 '12 13:07 dowobeha

This can be accomplished similarly under current syntax as:

global {
  lm_file="/experiments/models/order-5.srilm"
  dev_references="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
  test_references="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"

  dev_translations=(Exp:
           exp_iwslt09="/experiments/exp-iwslt09/dev*.out"
      exp_iwslt09_dunk="/experiments/exp-iwslt09-dunk/dev*.out"
        exp_x5_iwslt09="/experiments/exp-x5-iwslt09/dev*.out"
       exp_x5_iwslt09_dunk="/experiments/exp-x5-iwslt09-dunk/dev*.out"
       exp_x10_iwslt09="/experiments/exp-x10-iwslt09/dev*.out"
      exp_x10_iwslt09_dunk="/experiments/exp-x10-iwslt09-dunk/dev*.out"
  )

  test_translations=(Exp:
           exp_iwslt09="/experiments/exp-iwslt09/optimized*.out"
      exp_iwslt09_dunk="/experiments/exp-iwslt09-dunk/optimized*.out"
        exp_x5_iwslt09="/experiments/exp-x5-iwslt09/optimized*.out"
       exp_x5_iwslt09_dunk="/experiments/exp-x5-iwslt09-dunk/optimized*.out"
       exp_x10_iwslt09="/experiments/exp-x10-iwslt09/optimized*.out"
      exp_x10_iwslt09_dunk="/experiments/exp-x10-iwslt09-dunk/optimized*.out"
  )

}

jhclark avatar Jul 12 '12 14:07 jhclark

However, we'd also like to be able to break each experiment out into a separate config file. This effectively inverts how the nesting from above is performed. Notice how this will become increasingly attractice as the number of key=value pairs inside each "Exp" block increases.

global {
  lm_file="/experiments/models/order-5.srilm"
  dev_references="/experiments/tests/iwslt09-ae/data/ae-dev7-token-*.ref"
  test_references="/experiments/tests/iwslt09-ae/data/ae-dev6-token-*.ref"

  branch (Exp: exp_iwslt09) {
    dev_translations="/experiments/exp-iwslt09/dev*.out"
    test_translations="/experiments/exp-iwslt09/optimized*.out"
  }

  branch (Exp: exp_iwslt09_dunk) {
    dev_translations="/experiments/exp-iwslt09-dunk/dev*.out"
    test_translations="/experiments/exp-iwslt09-dunk/optimized*.out"
  }

  branch (Exp: exp_x5_iwslt09) {
    dev_translations="/experiments/exp-x5-iwslt09/dev*.out"
    test_translations="/experiments/exp-x5-iwslt09/optimized*.out"
  }

  branch (Exp: exp_x5_iwslt09_dunk) {
    dev_translations="/experiments/exp-x5-iwslt09-dunk/dev*.out"
    test_translations="/experiments/exp-x5-iwslt09-dunk/optimized*.out"
  }

  branch (Exp: exp_x10_iwslt09) {
    dev_translations="/experiments/exp-x10-iwslt09/dev*.out"
    test_translations="/experiments/exp-x10-iwslt09/optimized*.out"
  }

  branch (Exp: exp_x10_iwslt09_dunk) {
    dev_translations="/experiments/exp-x10-iwslt09-dunk/dev*.out"
    test_translations="/experiments/exp-x10-iwslt09-dunk/optimized*.out"
  }
}

Now, each "branch" block could be broken out into its own file. This will become an even more powerful feature once we implement an import mechanism.

jhclark avatar Jul 12 '12 14:07 jhclark

Should branch really go inside global?

dowobeha avatar Jul 13 '12 19:07 dowobeha

Assuming we do away with "config", yes. Currently "branch" could also go inside "config" as well.

jhclark avatar Jul 13 '12 19:07 jhclark

The question was more, should branch be a standalone block rather than inside something else?

dowobeha avatar Jul 13 '12 19:07 dowobeha

Seems reasonable for it to go inside or outside of global, really. I would support either or both, if you have some preference for one of them.

jhclark avatar Jul 13 '12 19:07 jhclark

I'm not sure yet.

dowobeha avatar Jul 13 '12 19:07 dowobeha

This will involve changing both the AST parser and TaskTemplateBuilder.

jhclark avatar Jan 03 '13 21:01 jhclark