seamless_communication icon indicating copy to clipboard operation
seamless_communication copied to clipboard

Processed text and audio cannot match in Seamless_align_data

Open lzl-mt opened this issue 1 year ago • 0 comments

HI, i find some mismatch audio and translated text (label of this audio), and I want to know if I am processing the data incorrectly. E.g., for seamless.dataset.metadata.public.eng-zhA.tsv this META file. I use wget tools to download audio, e.g., this MP3: http://audio2.abiblica.org/bibles/app/audio/4/24/8.mp3 Then use ffmpeg to cut the wav from start frame to end frame CC-MAIN-2022-21:2962 http://audio2.abiblica.org/bibles/app/audio/4/24/8.mp3 3624960 3705312 0 0 0.0 1.1922385 eng-zhA zhA 866761 And i use wet_line to get the text by zcat seamless.dataset.metadata.public.eng-zhA.tsv.gz | grep -A 1 66BUOO6LJ5CT4YOWUMAIERUHGBYIYOPD | egrep ^crawl-data | tr '\t' ' ' | build/bin/wet_lines | python3 -c "from sentence_cleaner_splitter.cleaner_splitter import *; split_clean()" It returns crawl-data/CC-MAIN-2019-13/segments/1552912202672.57/wet/CC-MAIN-20190322155929-20190322181929-00227.warc.wet.gz sha1:66BUOO6LJ5CT4YOWUMAIERUHGBYIYOPD http://bazaizhongguo.blogspot.com/2007/06/beorn-is-not-xiao-xiongmao-either.html 0 10105196188100455272 10825353432255915926 0.99567 1.1922385 eng-zhA eng 866761 Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either… Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either... FInally, i choose Postcards from the Middle Kingdom: Beorn is not a xiao xiongmao, either... the translated text as the label of the segmented audio, But they don't match at all.

lzl-mt avatar Sep 12 '23 08:09 lzl-mt