Osma Suominen

Results 374 comments of Osma Suominen

Changing the milestone as 0.39 is going to be released soon but all of the developments in this issue are unlikely to be finished by then.

It should be noted that Lingua is a fairly new library and so has a very short track record, with only two releases so far.

There is an issue asking about Python 3.10 support for pycld3: https://github.com/bsolomon1124/pycld3/issues/31

As pointed out by @adulau in [this comment](https://github.com/bsolomon1124/pycld3/issues/31#issuecomment-1165322512), Lingua can use huge amounts of memory. I tested it in the default lazy loading configuration, and detecting the language of the...

Thanks @pemistahl , that is excellent news! We will take a new look at Lingua.

@pemistahl Whoa, that's quite an improvement!

I did some testing of Lingua in a draft PR #615, you may want to check that out @pemistahl

@adbar suggested these other language detection approaches in https://github.com/NatLibFi/Annif/issues/617#issuecomment-1234308765 : > * Simplemma should be good enough and especially good on noisy text. > * I've used langid.py ever since...

I created PR #626 which uses Simplemma for language detection instead of pycld3 (or Lingua in PR #615). I intend to benchmark these three approaches in the near future.

I have now redone the benchmarks described in https://github.com/NatLibFi/Annif/pull/615#issue-1352320697 with some changes. This time I used the parts of the Finto AI data set and Finnish language documents and YSO...