tatoeba2 Can't reduce the share of similar sentences in search results

Can't reduce the share of similar sentences in search results

Open LBeaudoux opened this issue 3 years ago • 20 comments

Story

When I search for an English word, I often get a long list of very similar sentences that look like they were generated by a robot. I usually scroll down a few pages and give up because I feel like I'm wasting my time reading sentences that offer little new information.

I would like to have the possibility to increase the diversity of my search results by reducing the number of similar sentences.

Measuring the diversity of the Tatoeba corpus

For a corpus to appear diverse, it must contain significant amounts of original sentences. An added sentence can be considered original when it includes many words that do not appear in a similar context elsewhere in the corpus.

We can therefore measure the originality of a sentence by splitting it into sequences of three consecutive words (a.k.a. trigram) and calculating the proportion of sequences that were new at the time of addition. This originality score ranges from 0 to 1. When all the trigrams of a sentence are new, it has a maximum originality of 1. On the other hand, when all the trigrams of a sentence have already been observed in other sentences, it is considered as unoriginal and has a score of 0.

When we measure the originalities of Tatoeba sentences and analyze the results by language, we realize that the distribution profiles vary greatly. For example, the English corpus contains a very high proportion of unoriginal sentences (i.e. with a score equal to 0).

Other corpora like the Spanish corpus are more balanced with both original and less original sentences.

Finally, contributors from some languages favor original sentences.

The morphological typology of a language affects the originality score of its sentences. Also, the average originality of the additions decreases mechanically with the size of the corpus. Nevertheless, the behavior of the most active contributors seems to be the most decisive factor. For example, more than 80 percent of the unoriginal English sentences can be attributed to a single contributor (through his various accounts).

Username	Number of unoriginal English sentences	Percentage of all English unoriginal sentences
CK	347019	60.1
CH	54846	9.5
CT	26327	4.6
CF	19820	3.4
OsoHombre	18536	3.2
Amastan	16857	2.9
CM	14449	2.5
Hybrid	14304	2.5
sundown	3553	0.6
CC	3453	0.6

The share of unoriginal sentences is increasing in most corpora, partly through the translation of the numerous unoriginal English sentences. Another growing practice is the multiplication of paraphrases (a.k.a. near-duplicates) linked to a source sentence. You can see below the evolution of the percentage of unoriginal sentences (i.e. with an originality score equal to zero) in the largest corpora.

Language	2011	2016	2021
English	11.2	27.5	42.1
Italian	1.5	23.9	36.2
Kabyle	nan	0	22.3
Portuguese	1	8.6	17.1
Russian	1	9.8	17
French	4.6	13.5	16.8
Berber	nan	8.7	16
Japanese	8.8	13.7	15.1
Esperanto	3.5	9.2	12.7
German	2.2	7	10.5
Spanish	2.3	9.3	10.4
Turkish	0.3	5.6	7.4
Hungarian	0.1	2.9	4.6

Idea During a search, it could be interesting to rely on the originality score in order to diversify the results by returning only the most original sentences. Users could easily vary their own minimum originality threshold thanks to a slider located on the right of the results counter. By default, this setting would be kept from one search to the next.

Jul 29 '21 16:07 LBeaudoux

tatoeba2 tatoeba2 copied to clipboard

Can't reduce the share of similar sentences in search results

tatoeba2
tatoeba2 copied to clipboard