question about domains

Open dataf3l opened this issue 4 years ago • 1 comments

Hi guys, I love this library.

I have a question: sometimes I get domain names as text input such as freizeit.com or toscanamare.com or someexample.com, notice that people don't nicely separate the text in the domain names like in "frei zeit" or "toscana mare", when I use a tokenizer, in order to detect the language of the domain, the tokenizer requires me to proivde a language, i.e. en.

is there a library that can, in a multi-language fashion split a word which contains more words into a. sub word by taking the best guess as to what the language is before splitting it, so that this library can do a good job at detecting the language from the text?

I googled "multi-language text split" but I'm not finding good results, I thought maybe you guys have worked on this issue before.

do you have hints for me?

Sep 29 '21 10:09 dataf3l

You could try the sentencepiece model from multilingual language processing pipelines. But they work on a subword level and you will have many possible combinations.

Oct 14 '21 10:10 Bachstelze