stopes icon indicating copy to clipboard operation
stopes copied to clipboard

MT Marathon 2022 starting point

Open Mortimerp9 opened this issue 3 years ago • 0 comments

Why ?

This is example code of how to change the mining pipeline to add some extra filtering in the mining phase as a starting point for the Prague's MT Marathon 2022.

Check the changes to mine_bitext_intexes_utils.py where you can see where the core mining is done from the list of neighbours and were posthoc filtering could be inserted.

How ?

  1. try the quickstart: https://facebookresearch.github.io/stopes/docs/quickstart this will show how to get data and run the mining pipeline end to end
  2. check this PR to see how the code is changed to pass down a config for the filter + the list of original text shards to the mining step. You could use the filter_config to pass down the path to a trained classifier or other parameters to the filtering if you wanted.
  3. replace noop_filter with something smarter to filter out the mined candidates.
  4. stopes will cache most steps. If you change the utils, bump the version in stopes/modules/bitext/mining/mine_bitext_indexes.py to make sure to recompute what you are changing. Then you can run the whole pipeline with the same command each time.

Mortimerp9 avatar Sep 01 '22 12:09 Mortimerp9