seamless_communication Fail downloading Seamless align data

when i follow https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/seamless_align_README.md, try to download the dataset, use zcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | build/bin/wet_lines raise Error: and no wav is saved; how can i fix it? THX ;)

Aug 30 '23 10:08 lzl-mt

I try again but still get same error, and save nothing, cost almost 2 days what(): /home/ubuntu/preprocess/preprocess/wet_lines_main.cc:71 in void Retrieve::Add(util::StringPiece, const Extract&) threw util::Exception because !extracts.empty() && extracts.back().paragraph_num ber > extract.paragraph_number'. Metadata should be sorted by paragraph number in each document`

Sep 05 '23 07:09 lzl-mt

hi @lzl-mt , The wet_lines tool will rebuild only the text part, not the audio part. To rebuild the audio, you can refer to the doc, the section "For Audio, the columns correspond to:". You can directly request the public url, convert to 16KHz and get the segments.

Regarding the error you get with wet_lines, it looks like there is an information missing in the document. One should sort the metadata by cc_lineno. Also, it would be more efficient to sort by CommonCrawl reference. Could you please try zcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | sort -k1 -k4 | build/bin/wet_lines ?

Regards, Onur

Sep 07 '23 07:09 Celebio

@Celebio Thanks for your reply, i try zcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | sort -k1 -k4 | build/bin/wet_lines, but got terminate called after throwing an instance of 'util::Exception' what(): /home/ubuntu/preprocess/preprocess/wet_lines_main.cc:71 in void Retrieve::Add(util::StringPiece, const Extract&) threw util::Exception because !extracts.empty() && extracts.back().paragraph_number > extract.paragraph_number'. Metadata should be sorted by paragraph number in each document` again.

Sep 08 '23 05:09 lzl-mt

Still have another question.. After I divided the audio according to the start frame and end frame, I found that the corresponding translated text was only part of the corresponding output text (the output corresponding to wet_lines). Is there any way to match the segmented audio with the corresponding segmented text processed by wet_lines? Thanks a lot!

Sep 08 '23 12:09 lzl-mt

hi @lzl-mt , The wet_lines tool will rebuild only the text part, not the audio part. To rebuild the audio, you can refer to the doc, the section "For Audio, the columns correspond to:". You can directly request the public url, convert to 16KHz and get the segments.

Regarding the error you get with wet_lines, it looks like there is an information missing in the document. One should sort the metadata by cc_lineno. Also, it would be more efficient to sort by CommonCrawl reference. Could you please try zcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | sort -k1 -k4 | build/bin/wet_lines ?

Regards, Onur

Maybe i should use sort -n ?

Sep 12 '23 07:09 lzl-mt

Maybe i should use sort -n ?

for the 4th column yes, you can use -kn4

Sep 12 '23 09:09 Celebio

translated text was only part of the corresponding output text (the output corresponding to wet_lines).

You need to extract the sentence from the paragraph by using the sentence_cleaner_splitter

Sep 12 '23 09:09 Celebio

translated text was only part of the corresponding output text (the output corresponding to wet_lines).

You need to extract the sentence from the paragraph by using the sentence_cleaner_splitter

I used the following method in this ISSUE [https://github.com/facebookresearch/seamless_communication/issues/147] , but there was a problem that the audio and text could not match...

E.g., for seamless.dataset.metadata.public.eng-zhA.tsv this META file. I use wget tools to download audio, e.g., this MP3: http://audio2.abiblica.org/bibles/app/audio/4/24/8.mp3 Then use ffmpeg to cut the wav from start frame to end frame CC-MAIN-2022-21:2962 http://audio2.abiblica.org/bibles/app/audio/4/24/8.mp3 **3624960** **3705312** 0 0 0.0 1.1922385 eng-zhA zhA 866761 And i use wet_line to get the text by zcat seamless.dataset.metadata.public.eng-zhA.tsv.gz | grep -A 1 66BUOO6LJ5CT4YOWUMAIERUHGBYIYOPD | egrep ^crawl-data | tr '\t' ' ' | build/bin/wet_lines | python3 -c "from sentence_cleaner_splitter.cleaner_splitter import *; split_clean()" It returns crawl-data/CC-MAIN-2019-13/segments/1552912202672.57/wet/CC-MAIN-20190322155929-20190322181929-00227.warc.wet.gz sha1:66BUOO6LJ5CT4YOWUMAIERUHGBYIYOPD http://bazaizhongguo.blogspot.com/2007/06/beorn-is-not-xiao-xiongmao-either.html 0 10105196188100455272 10825353432255915926 0.99567 1.1922385 eng-zhA eng 866761 Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either… Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either... FInally, i choose Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either... as the label of the segmented audio, But they don't match at all. Is there something wrong with the way I handle the data?

Sep 13 '23 03:09 lzl-mt

The hosted audio may have changed in the meantime. We plan to provide additional metadata with the duration we expect the mp3 file to be.

Sep 21 '23 14:09 Celebio

The hosted audio may have changed in the meantime. We plan to provide additional metadata with the duration we expect the mp3 file to be.

how should i use paragraph_digest in my script? I used to split the wav by start_frame/end_frame in corresponding meta file

Sep 27 '23 10:09 lzl-mt

hi @lzl-mt , we updated the metadata with the duration information, along with the documentation.

In particular, the main change is:

- `paragraph_digest`: expected duration of the whole audio file (without start/end frame trimming)

So, when you split the metadata, one of the columns that used to be empty or 0 should now contain the expected duration.

You can find the updated metadata here.

Sep 27 '23 12:09 Celebio

seamless_communication seamless_communication copied to clipboard

Fail downloading Seamless align data

seamless_communication
seamless_communication copied to clipboard