seamless_communication
seamless_communication copied to clipboard
Fail downloading Seamless align data
when i follow https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/seamless_align_README.md, try to download the dataset, use
zcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | build/bin/wet_lines
raise Error:
and no wav is saved;
how can i fix it? THX ;)
I try again but still get same error, and save nothing, cost almost 2 days
what(): /home/ubuntu/preprocess/preprocess/wet_lines_main.cc:71 in void Retrieve::Add(util::StringPiece, const Extract&) threw util::Exception because
!extracts.empty() && extracts.back().paragraph_num ber > extract.paragraph_number'. Metadata should be sorted by paragraph number in each document`
hi @lzl-mt ,
The wet_lines
tool will rebuild only the text part, not the audio part. To rebuild the audio, you can refer to the doc, the section "For Audio, the columns correspond to:". You can directly request the public url, convert to 16KHz and get the segments.
Regarding the error you get with wet_lines
, it looks like there is an information missing in the document. One should sort the metadata by cc_lineno
. Also, it would be more efficient to sort by CommonCrawl reference.
Could you please try
zcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | sort -k1 -k4 | build/bin/wet_lines
?
Regards, Onur
@Celebio Thanks for your reply, i try zcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | sort -k1 -k4 | build/bin/wet_lines
, but got terminate called after throwing an instance of 'util::Exception' what(): /home/ubuntu/preprocess/preprocess/wet_lines_main.cc:71 in void Retrieve::Add(util::StringPiece, const Extract&) threw util::Exception because
!extracts.empty() && extracts.back().paragraph_number > extract.paragraph_number'. Metadata should be sorted by paragraph number in each document` again.
Still have another question.. After I divided the audio according to the start frame and end frame, I found that the corresponding translated text was only part of the corresponding output text (the output corresponding to wet_lines
). Is there any way to match the segmented audio with the corresponding segmented text processed by wet_lines
?
Thanks a lot!
hi @lzl-mt , The
wet_lines
tool will rebuild only the text part, not the audio part. To rebuild the audio, you can refer to the doc, the section "For Audio, the columns correspond to:". You can directly request the public url, convert to 16KHz and get the segments.Regarding the error you get with
wet_lines
, it looks like there is an information missing in the document. One should sort the metadata bycc_lineno
. Also, it would be more efficient to sort by CommonCrawl reference. Could you please tryzcat seamless.dataset.metadata.public.arb-enA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | sort -k1 -k4 | build/bin/wet_lines
?Regards, Onur
Maybe i should use sort -n
?
Maybe i should use
sort -n
?
for the 4th column yes, you can use -kn4
translated text was only part of the corresponding output text (the output corresponding to wet_lines).
You need to extract the sentence from the paragraph by using the sentence_cleaner_splitter
translated text was only part of the corresponding output text (the output corresponding to wet_lines).
You need to extract the sentence from the paragraph by using the sentence_cleaner_splitter
I used the following method in this ISSUE [https://github.com/facebookresearch/seamless_communication/issues/147] , but there was a problem that the audio and text could not match...
E.g., for seamless.dataset.metadata.public.eng-zhA.tsv
this META file.
I use wget tools to download audio, e.g., this MP3:
http://audio2.abiblica.org/bibles/app/audio/4/24/8.mp3
Then use ffmpeg to cut the wav from start frame to end frame
CC-MAIN-2022-21:2962 http://audio2.abiblica.org/bibles/app/audio/4/24/8.mp3 **3624960** **3705312** 0 0 0.0 1.1922385 eng-zhA zhA 866761
And i use wet_line to get the text by
zcat seamless.dataset.metadata.public.eng-zhA.tsv.gz | grep -A 1 66BUOO6LJ5CT4YOWUMAIERUHGBYIYOPD | egrep ^crawl-data | tr '\t' ' ' | build/bin/wet_lines | python3 -c "from sentence_cleaner_splitter.cleaner_splitter import *; split_clean()"
It returns
crawl-data/CC-MAIN-2019-13/segments/1552912202672.57/wet/CC-MAIN-20190322155929-20190322181929-00227.warc.wet.gz sha1:66BUOO6LJ5CT4YOWUMAIERUHGBYIYOPD http://bazaizhongguo.blogspot.com/2007/06/beorn-is-not-xiao-xiongmao-either.html 0 10105196188100455272 10825353432255915926 0.99567 1.1922385 eng-zhA eng 866761 Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either… Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either...
FInally, i choose Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either...
as the label of the segmented audio, But they don't match at all.
Is there something wrong with the way I handle the data?
The hosted audio may have changed in the meantime. We plan to provide additional metadata with the duration we expect the mp3 file to be.
The hosted audio may have changed in the meantime. We plan to provide additional metadata with the duration we expect the mp3 file to be.
how should i use paragraph_digest
in my script? I used to split the wav by start_frame/end_frame in corresponding meta file
hi @lzl-mt , we updated the metadata with the duration information, along with the documentation.
In particular, the main change is:
- `paragraph_digest`: expected duration of the whole audio file (without start/end frame trimming)
So, when you split the metadata, one of the columns that used to be empty or 0 should now contain the expected duration.
You can find the updated metadata here.