SmartReader Adding support for Language Identification

Adding support for Language Identification

Open gabriele-tomassetti opened this issue 4 years ago • 4 comments

Fasttext will allow to implement effective language identification, with little space and resources required. This can be useful for a lot of content that has no language and also for content that contains multiple languages.

Mar 21 '20 07:03 gabriele-tomassetti

@gabriele-tomassetti @ftomassetti I'm the maintainer of an open-source C# NLP library that has two models for language detection: https://github.com/curiosity-ai/catalyst/ If you want I can either port the code from there, or add as a dependency to cover this need.

Sep 04 '20 15:09 theolivenbaum

Thanks for your offer to help on this issue, too. Honestly, I was mostly looking at this issue as an excuse to work on a NLP library, but if your library can do it better and sooner, I see no reason not to use it.

I think we only have two requirements:

we need to add it as a dependency, rather than port the code from there because there is no need to add other code to maintain, if we can avoid it
we need to make sure that the library does not require much space. Otherwise I think we would need to make it a separate nuget package for this functionality

Sep 05 '20 11:09 gabriele-tomassetti

We could add it as a callback you need to provide, and just add an example on the Wiki of how to use it with Catalyst for example

Sep 07 '20 09:09 theolivenbaum

That's a really smart idea. I will work on it.

Sep 08 '20 07:09 gabriele-tomassetti

SmartReader SmartReader copied to clipboard

Adding support for Language Identification

SmartReader
SmartReader copied to clipboard