stopes
stopes copied to clipboard
MT Marathon 2022 starting point
Why ?
This is example code of how to change the mining pipeline to add some extra filtering in the mining phase as a starting point for the Prague's MT Marathon 2022.
Check the changes to mine_bitext_intexes_utils.py where you can see where the core mining is done from the list of neighbours and were posthoc filtering could be inserted.
How ?
- try the quickstart: https://facebookresearch.github.io/stopes/docs/quickstart this will show how to get data and run the mining pipeline end to end
- check this PR to see how the code is changed to pass down a config for the filter + the list of original text shards to the mining step. You could use the
filter_configto pass down the path to a trained classifier or other parameters to the filtering if you wanted. - replace
noop_filterwith something smarter to filter out the mined candidates. - stopes will cache most steps. If you change the utils, bump the version in
stopes/modules/bitext/mining/mine_bitext_indexes.pyto make sure to recompute what you are changing. Then you can run the whole pipeline with the same command each time.