whatlanguage
whatlanguage copied to clipboard
Library doesn't seem to take character sets into account
I'm trying to distinguish between a couple of European languages and Turkish/Arabic/Aramaic. Whatlanguage does a fair job of the European languages, but beyond that falls apart at the seams.
main » wl.language("Bilgi Teknolojileri Kurumu (BTK) tarafından 29 Nisan 2017 tarihinde.")
=> :russian
It cannot be Russian, as that would be written with a different set of characters.
main » wl.language("البرنامج ليس ذكي جدا")
=> :arabic
ًWorks fine, even though I wasn't very nice to it. But as evidenced in #41, there are issues there, too. The reporter of #41 doesn't make it very explicit, but hits the same spot, especially with the numbers. (His first and second strings are easily recognizable as Farsi, not Arabic, by way of their second, i.e. the left, word not being part of the Arabic dictionary but very commonly used in Farsi).
main » wl.language("ܣܰܪܐ: ܓܷܕ ܣܳܚܝܢܰܐ، ܓܷܕ ܡܫܰܡܣܝܢܰܐ، ܓܷܕ ܫܳܬܝܢܰܐ ܩܰܚܘܰܐ ܘܦܰܠܓܶܗ ܕܝܰܘܡܐ ܠܰܦ ܐܝ ܣܰܥܰܐ ܬܪܰܥܣܰܪ ܘܦܰܠܓܶܗ ܓܷܕ ܡܰܥܪܝܢܰܐ. ܗܰܘܟ݂ܰܐ ܓܷܕ ܫܳܦܰܥ ܐܘ ܝܰܘܡܰܝܕ݂ܰܢ.")
=> :russian
Makes one wonder if Russian is a last-resort fallback. Again, though, it cannot possibly be Russian, because it's a completely different character set, namely that of Aramaic.
I'd also like to point out #27 again at this point. I do so with a sad face. In addition I would like to point out that these matters have been noticed elsewhere as well: "[...] this project still has a way [sic] to go [...]", posted to StackExchange on July 9, 2014.
I think this is the same issue as #18 :-/