preprocess
preprocess copied to clipboard
Corpus preprocessing
when i follow https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/seamless_align_README.md, try to download the dataset, use zcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | build/bin/wet_lines raise Error:  and no wav is saved;...
Adds a `Reserve()` method to `AutoProbing` that is necessary to get Bitextor's document aligner to work with this type of table. I tried multiple setups where I just initialised the...
Single document splitting just directly goes from Perl's STDIN to STDOUT now. In multidoc mode I locally override STDIN and STDOUT to point to variables. I still buffer a single...
When used in `-k` mode, one would expect the sentence splitter to use a small amount of RAM, just enough to store a single line. However, it actually stores the...
Some documents contain extremely long lines of generated text (most often links to search page results) that take forever to parse with the regular expressions in split-sentences.perl. Using the -c...
@jelmervdl Problem with foldfilter: if we translate from e.g. ko (without spaces) to en then the output concatenates the last English word of a preceding sentence with the first English...
``` tokenize_piece_test.cc:(.text.startup+0xb): undefined reference to `boost::unit_test::unit_test_main(bool (*)(), int, char**)' ```
On input `->` the Moses truecase script does `- >` but the C++ does `->`. The additional space seems to appear regardless of what is before `>`.
This is supposed to expect a '\' to separate documents, but the regex looks for any tag, creating issues in non-escaped texts.