Kenneth Heafield
Kenneth Heafield
> Currently there are no palette options in the app anyway. The purpose of this issue is to discuss what the app should do.
Can we call it OPUS?
Aren't we already supposed to be doing this for e.g. JW300? https://github.com/mozilla/firefox-translations-training/blob/3b3f33bf2581238d325f05015123fc0a026c394e/pipeline/clean/fixes/mtdata_JW300.sh
No it's the sacrebleu importer https://github.com/mozilla/firefox-translations-training/blob/3b3f33bf2581238d325f05015123fc0a026c394e/pipeline/data/importers/corpus/sacrebleu.sh#L16
There may still be crashes after that if there's e.g. a long URL that SPM tokenizes to something long. Ideally we'd clean based on the SPM length of the sentence.
``` MaxConnPerIP 4 MaxConnPerIP 1 MaxConnPerIP 1 MaxConnPerIP 1 SetOutputFilter RATE_LIMIT # SetEnv rate-limit 50000 SetEnv rate-limit 1000 SetOutputFilter RATE_LIMIT SetEnv rate-limit 1000 SetOutputFilter RATE_LIMIT SetEnv rate-limit 1000 SetOutputFilter RATE_LIMIT...
The `interpolate` program accepts `-t` for the file to tune on and `--just_tune` if you want it to stop after tuning weights. Keep in mind these are log-linear weights.
Would require editing the code here: https://github.com/kpu/kenlm/blob/7af246801e05b5f3b9d2f6a34a820f8d9379f41a/lm/builder/corpus_count.cc#L242 Could probably be made into a command line option.
The python setup.py currently assumes gcc and a command line. If you can add python/kenlm.cpp to the Windows build and link against the appropriate python binding libraries it should work....
Yes, I mainly need to learn how to do that.