lhotse icon indicating copy to clipboard operation
lhotse copied to clipboard

Error while preparing CommonVoice

Open mohsen-goodarzi opened this issue 2 years ago • 4 comments

Hi, I manually downloaded the latest CommonVoice dataset for Dutch (cv-corpus-9.0-2022-04-27-nl), and then tried to prepare the manifest using lhotse by following command:

lhotse prepare commonvoice -l nl ./cv-corpus-9.0-2022-04-27 ./manifests

But I get the following error for every entry in the dataset and final manifest files are empty:

2022-06-09 16:08:02,713 ERROR [commonvoice.py:247] Error when processing TSV file: line no. 10551: 'client_id     d09ab3464067c2c3d93c079cbe442fa0f029f8e391d382...
path                               common_voice_nl_26946122.mp3
sentence      Ik hoop, beste collega's, dat er morgen nieman...
up_votes                                                      2
down_votes                                                    0
age                                                    twenties
gender                                                     male
accents                                   Nederlands Nederlands
locale                                                       nl
segment                                                     NaN
Name: 10551, dtype: object'.
Original error type: '<class 'AttributeError'>' and message: 'Series' object has no attribute 'accent'

I used a fresh installation of lhotse using following commands in a new venv, so it must be easily reproducible:

pip install git+https://github.com/lhotse-speech/lhotse
pip install pandas

mohsen-goodarzi avatar Jun 09 '22 14:06 mohsen-goodarzi

Is it possible that https://github.com/lhotse-speech/lhotse/blob/664a594d872630af312c355b81a270ea00a362e9/lhotse/recipes/commonvoice.py#L28 is different from the latest one cv-corpus-9.0-2022-04-27-nl?

csukuangfj avatar Jun 09 '22 15:06 csukuangfj

@csukuangfj I noticed that the accent column in cv-corpus-5.1-2020-06-22 contains just single word values but in cv-corpus-9.0-2022-04-27-nl, they are two words separated by space like: Nederlands Nederlands or Frans Nederlands. Could it be the reason?

mohsen-goodarzi avatar Jun 09 '22 17:06 mohsen-goodarzi

It seems they changed accent column to accents column in the new release; it seems like it should be fairly easily fixable if we assume we don't support the older CV versions, or we can add some logic to resolve the version and parse the right column. Do you mind submitting a PR?

pzelasko avatar Jun 09 '22 17:06 pzelasko

Of course not. Please see PR #743.

mohsen-goodarzi avatar Jun 10 '22 08:06 mohsen-goodarzi

Fixed in https://github.com/lhotse-speech/lhotse/pull/743.

desh2608 avatar Sep 22 '22 17:09 desh2608