seamless_communication icon indicating copy to clipboard operation
seamless_communication copied to clipboard

Fail downloading Seamless align data

Open lzl-mt opened this issue 1 year ago • 11 comments

when i follow https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/seamless_align_README.md, try to download the dataset, use zcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | build/bin/wet_lines raise Error: image and no wav is saved; how can i fix it? THX ;)

lzl-mt avatar Aug 30 '23 10:08 lzl-mt

I try again but still get same error, and save nothing, cost almost 2 days what(): /home/ubuntu/preprocess/preprocess/wet_lines_main.cc:71 in void Retrieve::Add(util::StringPiece, const Extract&) threw util::Exception because !extracts.empty() && extracts.back().paragraph_num ber > extract.paragraph_number'. Metadata should be sorted by paragraph number in each document`

lzl-mt avatar Sep 05 '23 07:09 lzl-mt

hi @lzl-mt , The wet_lines tool will rebuild only the text part, not the audio part. To rebuild the audio, you can refer to the doc, the section "For Audio, the columns correspond to:". You can directly request the public url, convert to 16KHz and get the segments.

Regarding the error you get with wet_lines, it looks like there is an information missing in the document. One should sort the metadata by cc_lineno. Also, it would be more efficient to sort by CommonCrawl reference. Could you please try zcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | sort -k1 -k4 | build/bin/wet_lines ?

Regards, Onur

Celebio avatar Sep 07 '23 07:09 Celebio

@Celebio Thanks for your reply, i try zcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | sort -k1 -k4 | build/bin/wet_lines, but got terminate called after throwing an instance of 'util::Exception' what(): /home/ubuntu/preprocess/preprocess/wet_lines_main.cc:71 in void Retrieve::Add(util::StringPiece, const Extract&) threw util::Exception because !extracts.empty() && extracts.back().paragraph_number > extract.paragraph_number'. Metadata should be sorted by paragraph number in each document` again.

lzl-mt avatar Sep 08 '23 05:09 lzl-mt

Still have another question.. After I divided the audio according to the start frame and end frame, I found that the corresponding translated text was only part of the corresponding output text (the output corresponding to wet_lines). Is there any way to match the segmented audio with the corresponding segmented text processed by wet_lines? Thanks a lot!

lzl-mt avatar Sep 08 '23 12:09 lzl-mt

hi @lzl-mt , The wet_lines tool will rebuild only the text part, not the audio part. To rebuild the audio, you can refer to the doc, the section "For Audio, the columns correspond to:". You can directly request the public url, convert to 16KHz and get the segments.

Regarding the error you get with wet_lines, it looks like there is an information missing in the document. One should sort the metadata by cc_lineno. Also, it would be more efficient to sort by CommonCrawl reference. Could you please try zcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | sort -k1 -k4 | build/bin/wet_lines ?

Regards, Onur

Maybe i should use sort -n ?

lzl-mt avatar Sep 12 '23 07:09 lzl-mt

Maybe i should use sort -n ?

for the 4th column yes, you can use -kn4

Celebio avatar Sep 12 '23 09:09 Celebio

translated text was only part of the corresponding output text (the output corresponding to wet_lines).

You need to extract the sentence from the paragraph by using the sentence_cleaner_splitter

Celebio avatar Sep 12 '23 09:09 Celebio

translated text was only part of the corresponding output text (the output corresponding to wet_lines).

You need to extract the sentence from the paragraph by using the sentence_cleaner_splitter

I used the following method in this ISSUE [https://github.com/facebookresearch/seamless_communication/issues/147] , but there was a problem that the audio and text could not match...

E.g., for seamless.dataset.metadata.public.eng-zhA.tsv this META file. I use wget tools to download audio, e.g., this MP3: http://audio2.abiblica.org/bibles/app/audio/4/24/8.mp3 Then use ffmpeg to cut the wav from start frame to end frame CC-MAIN-2022-21:2962 http://audio2.abiblica.org/bibles/app/audio/4/24/8.mp3 **3624960** **3705312** 0 0 0.0 1.1922385 eng-zhA zhA 866761 And i use wet_line to get the text by zcat seamless.dataset.metadata.public.eng-zhA.tsv.gz | grep -A 1 66BUOO6LJ5CT4YOWUMAIERUHGBYIYOPD | egrep ^crawl-data | tr '\t' ' ' | build/bin/wet_lines | python3 -c "from sentence_cleaner_splitter.cleaner_splitter import *; split_clean()" It returns crawl-data/CC-MAIN-2019-13/segments/1552912202672.57/wet/CC-MAIN-20190322155929-20190322181929-00227.warc.wet.gz sha1:66BUOO6LJ5CT4YOWUMAIERUHGBYIYOPD http://bazaizhongguo.blogspot.com/2007/06/beorn-is-not-xiao-xiongmao-either.html 0 10105196188100455272 10825353432255915926 0.99567 1.1922385 eng-zhA eng 866761 Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either… Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either... FInally, i choose Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either... as the label of the segmented audio, But they don't match at all. Is there something wrong with the way I handle the data?

lzl-mt avatar Sep 13 '23 03:09 lzl-mt

The hosted audio may have changed in the meantime. We plan to provide additional metadata with the duration we expect the mp3 file to be.

Celebio avatar Sep 21 '23 14:09 Celebio

The hosted audio may have changed in the meantime. We plan to provide additional metadata with the duration we expect the mp3 file to be.

how should i use paragraph_digest in my script? I used to split the wav by start_frame/end_frame in corresponding meta file

lzl-mt avatar Sep 27 '23 10:09 lzl-mt

hi @lzl-mt , we updated the metadata with the duration information, along with the documentation.

In particular, the main change is:

- `paragraph_digest`: expected duration of the whole audio file (without start/end frame trimming)

So, when you split the metadata, one of the columns that used to be empty or 0 should now contain the expected duration.

You can find the updated metadata here.

Celebio avatar Sep 27 '23 12:09 Celebio