preprocess icon indicating copy to clipboard operation
preprocess copied to clipboard

Fail downloading Seamless align data

Open lzl-mt opened this issue 1 year ago • 1 comments

when i follow https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/seamless_align_README.md, try to download the dataset, use zcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | build/bin/wet_lines raise Error: image

and no wav is saved; BTW, this script cost a lot of time to process something, but i cant find anything download in my workspace, is there any possible method to save each wav or text during the hole processing stage? Thx a lot.

lzl-mt avatar Aug 31 '23 06:08 lzl-mt

I try again but still get same error, and save nothing, cost almost 2 days what(): /home/ubuntu/preprocess/preprocess/wet_lines_main.cc:71 in void Retrieve::Add(util::StringPiece, const Extract&) threw util::Exception because !extracts.empty() && extracts.back().paragraph_num ber > extract.paragraph_number'. Metadata should be sorted by paragraph number in each document

lzl-mt avatar Sep 05 '23 07:09 lzl-mt