lhotse
lhotse copied to clipboard
Error while preparing CommonVoice
Hi, I manually downloaded the latest CommonVoice dataset for Dutch (cv-corpus-9.0-2022-04-27-nl), and then tried to prepare the manifest using lhotse by following command:
lhotse prepare commonvoice -l nl ./cv-corpus-9.0-2022-04-27 ./manifests
But I get the following error for every entry in the dataset and final manifest files are empty:
2022-06-09 16:08:02,713 ERROR [commonvoice.py:247] Error when processing TSV file: line no. 10551: 'client_id d09ab3464067c2c3d93c079cbe442fa0f029f8e391d382...
path common_voice_nl_26946122.mp3
sentence Ik hoop, beste collega's, dat er morgen nieman...
up_votes 2
down_votes 0
age twenties
gender male
accents Nederlands Nederlands
locale nl
segment NaN
Name: 10551, dtype: object'.
Original error type: '<class 'AttributeError'>' and message: 'Series' object has no attribute 'accent'
I used a fresh installation of lhotse using following commands in a new venv, so it must be easily reproducible:
pip install git+https://github.com/lhotse-speech/lhotse
pip install pandas
Is it possible that
https://github.com/lhotse-speech/lhotse/blob/664a594d872630af312c355b81a270ea00a362e9/lhotse/recipes/commonvoice.py#L28
is different from the latest one cv-corpus-9.0-2022-04-27-nl
?
@csukuangfj
I noticed that the accent
column in cv-corpus-5.1-2020-06-22
contains just single word values but in cv-corpus-9.0-2022-04-27-nl
, they are two words separated by space like: Nederlands Nederlands
or Frans Nederlands
.
Could it be the reason?
It seems they changed accent
column to accents
column in the new release; it seems like it should be fairly easily fixable if we assume we don't support the older CV versions, or we can add some logic to resolve the version and parse the right column. Do you mind submitting a PR?
Of course not. Please see PR #743.
Fixed in https://github.com/lhotse-speech/lhotse/pull/743.