SmartReader icon indicating copy to clipboard operation
SmartReader copied to clipboard

Adding support for Language Identification

Open gabriele-tomassetti opened this issue 4 years ago • 4 comments

Fasttext will allow to implement effective language identification, with little space and resources required. This can be useful for a lot of content that has no language and also for content that contains multiple languages.

gabriele-tomassetti avatar Mar 21 '20 07:03 gabriele-tomassetti

@gabriele-tomassetti @ftomassetti I'm the maintainer of an open-source C# NLP library that has two models for language detection: https://github.com/curiosity-ai/catalyst/ If you want I can either port the code from there, or add as a dependency to cover this need.

theolivenbaum avatar Sep 04 '20 15:09 theolivenbaum

Thanks for your offer to help on this issue, too. Honestly, I was mostly looking at this issue as an excuse to work on a NLP library, but if your library can do it better and sooner, I see no reason not to use it.

I think we only have two requirements:

  • we need to add it as a dependency, rather than port the code from there because there is no need to add other code to maintain, if we can avoid it
  • we need to make sure that the library does not require much space. Otherwise I think we would need to make it a separate nuget package for this functionality

gabriele-tomassetti avatar Sep 05 '20 11:09 gabriele-tomassetti

We could add it as a callback you need to provide, and just add an example on the Wiki of how to use it with Catalyst for example

theolivenbaum avatar Sep 07 '20 09:09 theolivenbaum

That's a really smart idea. I will work on it.

gabriele-tomassetti avatar Sep 08 '20 07:09 gabriele-tomassetti