preprocess issues

Fail downloading Seamless align data

1

when i follow https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/seamless_align_README.md, try to download the dataset, use zcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | build/bin/wet_lines raise Error: ![image](https://github.com/kpu/preprocess/assets/114667476/c27c2532-b968-4c83-9bfa-af41ca67a1fc) and no wav is saved;...

lzl-mt

Add `Reserve()` to `AutoProbing`

1

Adds a `Reserve()` method to `AutoProbing` that is necessary to get Bitextor's document aligner to work with this type of table. I tried multiple setups where I just initialised the...

jelmervdl

Change split_single_document to work on STDIN & STDOUT

Single document splitting just directly goes from Perl's STDIN to STDOUT now. In multidoc mode I locally override STDIN and STDOUT to point to variables. I still buffer a single...

jelmervdl

Sentence splitter uses unbounded memory in -k mode

When used in `-k` mode, one would expect the sentence splitter to use a small amount of RAM, just enough to store a single line. However, it actually stores the...

kpu

Add -c option to split-sentences.perl

1

Some documents contain extremely long lines of generated text (most often links to search page results) that take forever to parse with the regular expressions in split-sentences.perl. Using the -c...

jelmervdl

foldfilter breaks translation from language without spaces to language with spaces

4

@jelmervdl Problem with foldfilter: if we translate from e.g. ko (without spaces) to en then the output concatenates the last English word of a preceding sentence with the first English...

kpu