SmartReader
SmartReader copied to clipboard
Adding support for Language Identification
Fasttext will allow to implement effective language identification, with little space and resources required. This can be useful for a lot of content that has no language and also for content that contains multiple languages.
@gabriele-tomassetti @ftomassetti I'm the maintainer of an open-source C# NLP library that has two models for language detection: https://github.com/curiosity-ai/catalyst/ If you want I can either port the code from there, or add as a dependency to cover this need.
Thanks for your offer to help on this issue, too. Honestly, I was mostly looking at this issue as an excuse to work on a NLP library, but if your library can do it better and sooner, I see no reason not to use it.
I think we only have two requirements:
- we need to add it as a dependency, rather than port the code from there because there is no need to add other code to maintain, if we can avoid it
- we need to make sure that the library does not require much space. Otherwise I think we would need to make it a separate nuget package for this functionality
We could add it as a callback you need to provide, and just add an example on the Wiki of how to use it with Catalyst for example
That's a really smart idea. I will work on it.