seamless_communication
seamless_communication copied to clipboard
Processed text and audio cannot match in Seamless_align_data
HI, i find some mismatch audio and translated text (label of this audio), and I want to know if I am processing the data incorrectly.
E.g., for seamless.dataset.metadata.public.eng-zhA.tsv
this META file.
I use wget tools to download audio, e.g., this MP3:
http://audio2.abiblica.org/bibles/app/audio/4/24/8.mp3
Then use ffmpeg to cut the wav from start frame to end frame
CC-MAIN-2022-21:2962 http://audio2.abiblica.org/bibles/app/audio/4/24/8.mp3 3624960 3705312 0 0 0.0 1.1922385 eng-zhA zhA 866761
And i use wet_line to get the text by
zcat seamless.dataset.metadata.public.eng-zhA.tsv.gz | grep -A 1 66BUOO6LJ5CT4YOWUMAIERUHGBYIYOPD | egrep ^crawl-data | tr '\t' ' ' | build/bin/wet_lines | python3 -c "from sentence_cleaner_splitter.cleaner_splitter import *; split_clean()"
It returns
crawl-data/CC-MAIN-2019-13/segments/1552912202672.57/wet/CC-MAIN-20190322155929-20190322181929-00227.warc.wet.gz sha1:66BUOO6LJ5CT4YOWUMAIERUHGBYIYOPD http://bazaizhongguo.blogspot.com/2007/06/beorn-is-not-xiao-xiongmao-either.html 0 10105196188100455272 10825353432255915926 0.99567 1.1922385 eng-zhA eng 866761 Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either… Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either...
FInally, i choose Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either...
the translated text as the label of the segmented audio, But they don't match at all.