tatoeba2 icon indicating copy to clipboard operation
tatoeba2 copied to clipboard

Can't reduce the share of similar sentences in search results

Open LBeaudoux opened this issue 2 years ago • 20 comments

Story

When I search for an English word, I often get a long list of very similar sentences that look like they were generated by a robot. I usually scroll down a few pages and give up because I feel like I'm wasting my time reading sentences that offer little new information.

I would like to have the possibility to increase the diversity of my search results by reducing the number of similar sentences.

Measuring the diversity of the Tatoeba corpus

For a corpus to appear diverse, it must contain significant amounts of original sentences. An added sentence can be considered original when it includes many words that do not appear in a similar context elsewhere in the corpus.

We can therefore measure the originality of a sentence by splitting it into sequences of three consecutive words (a.k.a. trigram) and calculating the proportion of sequences that were new at the time of addition. This originality score ranges from 0 to 1. When all the trigrams of a sentence are new, it has a maximum originality of 1. On the other hand, when all the trigrams of a sentence have already been observed in other sentences, it is considered as unoriginal and has a score of 0.

When we measure the originalities of Tatoeba sentences and analyze the results by language, we realize that the distribution profiles vary greatly. For example, the English corpus contains a very high proportion of unoriginal sentences (i.e. with a score equal to 0).

image

Other corpora like the Spanish corpus are more balanced with both original and less original sentences.

image

Finally, contributors from some languages favor original sentences.

image

The morphological typology of a language affects the originality score of its sentences. Also, the average originality of the additions decreases mechanically with the size of the corpus. Nevertheless, the behavior of the most active contributors seems to be the most decisive factor. For example, more than 80 percent of the unoriginal English sentences can be attributed to a single contributor (through his various accounts).

Username Number of unoriginal English sentences Percentage of all English unoriginal sentences
CK 347019 60.1
CH 54846 9.5
CT 26327 4.6
CF 19820 3.4
OsoHombre 18536 3.2
Amastan 16857 2.9
CM 14449 2.5
Hybrid 14304 2.5
sundown 3553 0.6
CC 3453 0.6

The share of unoriginal sentences is increasing in most corpora, partly through the translation of the numerous unoriginal English sentences. Another growing practice is the multiplication of paraphrases (a.k.a. near-duplicates) linked to a source sentence. You can see below the evolution of the percentage of unoriginal sentences (i.e. with an originality score equal to zero) in the largest corpora.

Language 2011 2016 2021
English 11.2 27.5 42.1
Italian 1.5 23.9 36.2
Kabyle nan 0 22.3
Portuguese 1 8.6 17.1
Russian 1 9.8 17
French 4.6 13.5 16.8
Berber nan 8.7 16
Japanese 8.8 13.7 15.1
Esperanto 3.5 9.2 12.7
German 2.2 7 10.5
Spanish 2.3 9.3 10.4
Turkish 0.3 5.6 7.4
Hungarian 0.1 2.9 4.6

Idea During a search, it could be interesting to rely on the originality score in order to diversify the results by returning only the most original sentences. Users could easily vary their own minimum originality threshold thanks to a slider located on the right of the results counter. By default, this setting would be kept from one search to the next.

LBeaudoux avatar Jul 29 '21 16:07 LBeaudoux