whatlanguage Library doesn't seem to take character sets into account

Library doesn't seem to take character sets into account

Open sixtyfive opened this issue 8 years ago • 1 comments

I'm trying to distinguish between a couple of European languages and Turkish/Arabic/Aramaic. Whatlanguage does a fair job of the European languages, but beyond that falls apart at the seams.

main » wl.language("Bilgi Teknolojileri Kurumu (BTK) tarafından 29 Nisan 2017 tarihinde.")
=> :russian

It cannot be Russian, as that would be written with a different set of characters.

main » wl.language("البرنامج ليس ذكي جدا")
=> :arabic

ًWorks fine, even though I wasn't very nice to it. But as evidenced in #41, there are issues there, too. The reporter of #41 doesn't make it very explicit, but hits the same spot, especially with the numbers. (His first and second strings are easily recognizable as Farsi, not Arabic, by way of their second, i.e. the left, word not being part of the Arabic dictionary but very commonly used in Farsi).

main » wl.language("ܣܰܪܐ: ܓܷܕ ܣܳܚܝܢܰܐ، ܓܷܕ ܡܫܰܡܣܝܢܰܐ، ܓܷܕ ܫܳܬܝܢܰܐ ܩܰܚܘܰܐ ܘܦܰܠܓܶܗ ܕܝܰܘܡܐ ܠܰܦ ܐܝ ܣܰܥܰܐ ܬܪܰܥܣܰܪ ܘܦܰܠܓܶܗ ܓܷܕ ܡܰܥܪܝܢܰܐ. ܗܰܘܟ݂ܰܐ ܓܷܕ ܫܳܦܰܥ ܐܘ ܝܰܘܡܰܝܕ݂ܰܢ.")
=> :russian

Makes one wonder if Russian is a last-resort fallback. Again, though, it cannot possibly be Russian, because it's a completely different character set, namely that of Aramaic.

I'd also like to point out #27 again at this point. I do so with a sad face. In addition I would like to point out that these matters have been noticed elsewhere as well: "[...] this project still has a way [sic] to go [...]", posted to StackExchange on July 9, 2014.

Jul 08 '17 16:07 sixtyfive

I think this is the same issue as #18 :-/

Nov 09 '17 16:11 jm3

whatlanguage whatlanguage copied to clipboard

Library doesn't seem to take character sets into account

whatlanguage
whatlanguage copied to clipboard