tatoeba2 icon indicating copy to clipboard operation
tatoeba2 copied to clipboard

GlobalVoices import

Open Sobsz opened this issue 2 years ago • 4 comments

Yet another import suggestion, woo! (see: #1762, #2256, #2637, #2786)

GlobalVoices is a multilingual news site with all articles published under a free license. An indiscriminate import of every single sentence wouldn't be ideal, but there's a lot of decent material to pick and choose from (see gillux's suggestion for crowdsourced filtering before adding).

Articles from GlobalVoices up to the year 2018 are included in the OPUS corpus with all sentences automatically matched to their translations. This isn't flawless, so verification is still needed, but it's still an easy source of translated pairs. (Though, if two versions differ, should they be submitted separately and eventually get near-duplicate translations, or should one language take precedence?)

Another potential issue is a legal one. GlobalVoices is licensed under CC BY 3.0, whereas Tatoeba only supports the 2.0 version. I'm not sure if it matters that much, since the list of differences indicates that 3.0 only has two new restrictions: a "no endorsement" clause, which was already implicit in 2.0, and mandatory indication of adaptations (such as translations), which we're already doing. Tatoeba most likely wouldn't get sued over marking the sentences as 2.0, but it might be best to avoid that anyway, if practical.

Sobsz avatar Jul 21 '22 17:07 Sobsz