Kenneth Heafield

Results 290 comments of Kenneth Heafield

Ideally we'd replace buffering then splitting with splitting on the fly. Then if there's something long and no split we throw it out. Here I'm a bit concerned we're throwing...

it's worse than first thought. Even languages with spaces are losing them. Let's make a `cat` that strips leading and trailing whitespace, like most MT systems will. ``` #!/usr/bin/env python3...

Yeah ok we should have used `-s` for the fa->en part where I noticed the issue, then thought fa didn't have spaces, then looked at corpora and realized it does...

But the tokenizer is supposed to change those to < and > so it probably doesn't matter. (XML support is out of scope for the C++ version)

The default is prefer static linkage. `-DFORCE_STATIC` means fail to compile if dependencies are only available dynamically.

Ping @phikoehn @hieuhoang this is just a copy from Moses.

Does your file end with a newline? And use UNIX newlines?

This is really bad for e.g. `TildeMODEL-v2018.en-mt` which is sorted.

git repo of filter configurations? @eu9ene the longer term plan is the filters will be displayed as sort of review next to the data sets on Opus.

Examples are but one use case. We should be collecting the json files more centrally.