CorporaCreator icon indicating copy to clipboard operation
CorporaCreator copied to clipboard

Empty string sentence in cv-corpus-5-2020-06-22/en/test.tsv

Open antimora opened this issue 5 years ago • 8 comments

While processing entries from cv-corpus-5-2020-06-22/en/test.tsv, I have discovered an empty string sentence ("") on line #557 referencing common_voice_en_16759015.mp3. This entry also exists in validated.tsv. I haven't checked if there are more of the same type of errors.

antimora avatar Jul 22 '20 14:07 antimora

Strange, I thought the code prevented this, but there it is.

Oh I think I see what happened. The string is not empty but the string contains two quote marks.

It looks like the string originally was something like...

"<b> </b>..."

which contains, HTML tags, a non-printable character " " (&#160;), and likely other characters removed by common.py#L69.

After the HTML and other stuff was removed it was then turned into...

""

which, as it contains quotes, makes it through the check here corpus.py#L55 which should remove empty strings.

Have you listened to common_voice_en_16759015.mp3? Maybe that will give us more info on what the string originally was and how it got validated!?

kdavis-mozilla avatar Jul 23 '20 15:07 kdavis-mozilla

Just listened to it. It's someone reading back HTML tags.

I guess it was originally something like

"<HTML>one equals<em>"

and common.py#L69 turned that into

""

One nice thing is that's the only occurrence of this problem in the en test set and it doesn't occur in the en dev or train set.

kdavis-mozilla avatar Jul 23 '20 15:07 kdavis-mozilla

I guess maybe the solution is to add a language specific preprocessor that removes strings that are just ""?

kdavis-mozilla avatar Jul 23 '20 15:07 kdavis-mozilla

Thanks for looking into this.

Couldn't the solution be more general than removing "" (double quotes)?

If we could strip the beginning and ending quotes, then preprocessor would have caught "empty" string sentences. To me, the both of these lines are the same:

"Hello there"
Hello there

I see plenty of instances where transcripts start and end with quotes. Also from a developer's perspective it is confusing to see some transcripts are quoted and others not.

Related to the quote topic, I have noticed many transcripts contain "" (double quotes). Is this artifact created by the similar HTML cleansing logic you have described? Could we also collapse double quotes just like we are doing with double spaces. I saw a specific logic for it in common.py preprocessor.

Here is examples of double quotes from test:


"It was also known as the ""Sunflower""."
"Alston commented that he felt the cartoonist ""might have had some racial intent""."
"Karina Smirnoff of ""Dancing With The Stars"" hosted the following month."
"""The wind told me that you know about love"" the boy said to the sun."
"""Like everybody learns,"" he said."
"""Bambalio"" refers to a tendency to stammer."
"""I can work for the rest of today,"" the boy answered."
"The episode ""Father's Day"" depicts two younger versions of Jackie also played by Coduri."
"""Fatima,"" the girl said, averting her eyes."
"""This desert was once a sea,"" he said."
"It included a new production of ""Passion"" directed by Jamie Lloyd."
"He jumped up and turned quickly to face the imagined terror screaming ""Get back!"""
"You might hear ""font families"" more than ""typefaces"", even though they could mean the same thing."
"""Getting to play someone as unrestricted as a vampire is a thrill,"" she says."
"""Good-bye,"" said the boy."
"Philip then tells his parents that he was suspended for ""singing"" the National Anthem."
"It was later revealed that the letter was a prank concocted by ""The eXile""."
"The launch was flawless; all systems were ""go"", except for Doctor Wang's experiment."
"For centuries after her death, Welshmen cried-out ""Revenge for Gwenllian"" when engaging in battle."
"""And I'm certain you'll find it,"" the alchemist said."
"System B, however, does not depend explicitly on ""t"" so it is time-invariant."
"Beachley narrates the Seven Network factual series ""Beach Cops""."
"""Let's stop this,"" another commander said."
"Anna Austen asked about the acceptation of the word ""alliteration""."
"""Should I understand the Emerald Tablet?"" the boy asked."
"This bridge is unofficially referred to as ""Blackwater Bridge"" by Coalition Forces operating there."
"""This is the first phase of the job,"" he said."

antimora avatar Jul 23 '20 16:07 antimora

In my experience

"Hello there"

and

Hello there

are not spoken in the same manner.

The first, with quotes, is spoken with a bit more inflection with a rising tone in the first word to emphasize the speaker of the sentence is quoting someone else where as the second is spoken with no such effect. Stripping the quotes is thus removing information.

Double quotes, generally, indicate escaped quotes. Though this may not be the case for all double quotes in the text. For example

"It included a new production of ""Passion"" directed by Jamie Lloyd."

has escaped quotes around the word "Passion".

kdavis-mozilla avatar Jul 24 '20 08:07 kdavis-mozilla

In that case, this needs to be make clearer (in documentation or instructions) because from the context neither the reader, nor validator would know that quoted text should be read any differently. But I agree if the transcript contained quoted text, then it is read differently: He said "hello there", for example.

However, I still believe most quotes surrounding the transcripts are text processing artifacts of some sort. These two transcripts, from my previous comment, for instance, contain quotes within quotes. If it was true what you said about quoted texts are read differently, then how quotes within quotes should be read?

"""Getting to play someone as unrestricted as a vampire is a thrill,"" she says." """Good-bye,"" said the boy."

I get that as much information as possible should be preserved generally but in these cases I believe the readers, validators, and developers think quoted transcripts (beginning and end) are nothing more than text surrounded by quotes.

Perhaps, this is not the right place to address this quote issue. Probably it should be addressed somewhere upstream. If you could point me to the right direction, I'll be happy to follow up.

antimora avatar Jul 24 '20 18:07 antimora

As we don't document that one raises the tone of one's voice when reading a question, we also don't document the change in intonation when reading a quotation. This is simply part of what's entailed in "reading aloud".

However, I agree with you in that I also do not believe most quotes are surrounding the various sentences are to indicate the sentence is a quotation. In the majority of quoted text, e.g.

"Hello, how are you?"

the quotes simply are a means to delineate the text from its surroundings.

However, there are some cases in which I think the text contains quotes, but I'd have to look in detail at the entire pipeline to really differentiate between these two cases. I think @phirework has a much better view of the entire pipeline than I do. So maybe phirework could chime in?

kdavis-mozilla avatar Jul 27 '20 08:07 kdavis-mozilla

Re: OP - Kelly's correct, the original sentence was <html lang%3D"en">, and it was added to our db early enough that it was probably from before we had any proper sanitization. We can certainly add logic to the CorporaCreator to not pick entries that are just an empty string like that.

On the question of too many quotes, it looks like it has to do with the settings we're using for fast-csv, the library we use to write to TSV, and how it handles quoted fields that can potentially contain the delimiter (\t in our case). Can I get you to report that in https://github.com/Common-Voice/common-voice-bundler/ instead and I'll tweak the config? Feel free to just link to this discussion, since I'll be the one fixing it. Unfortunately I can't just move this issue since they're in separate Github orgs.

Thanks!

phirework avatar Jul 27 '20 22:07 phirework