contextualized-topic-models
contextualized-topic-models copied to clipboard
Korean Language Support
#71 Because of the structure of the Korean language, needed some different tokenizers instead of a white space tokenizer. Since konlpy is one of the famous Korean NLP python package, I've added konlpy to do tokenization and removing numbers.
test.txt test_preprocessed.txt
Tested on colab with Korean articles(5000, brief article).
Capture from colab
I did read the contributing guidelines, but since it's my first contribution on Github there will be some lack of information that needs to be provided. Feel free to request any additional information.
Want to know why Korean language can't be tokenized with a white space tokenizer? Check out https://en.wikipedia.org/wiki/Korean_grammar#Substantives
Hello @StrangeFate,
this looks great, I'd like to introduce this in a specific location of the source and to install this using a selective install. Point is, I'd also like to add tokenizers for other languages but I'd like not to force people to install all the tokenizers.
Give me a few days to think about this and thanks for the great work :)
Hello @StrangeFate,
this looks great, I'd like to introduce this in a specific location of the source and to install this using a selective install. Point is, I'd also like to add tokenizers for other languages but I'd like not to force people to install all the tokenizers.
Give me a few days to think about this and thanks for the great work :)
Hello @vinid,
I agree with you about forcing people to install all the tokenizers is not the appropriate way. That will be too much for users(I knew it as soon as I began modifying codes, but I'm kinda newbie to the python package so I couldn't come up with a better solution about it. :/ ) Anyway, I agree with what you said and it will be grateful if you could consider a better way to implement this.
Thanks for the great work!
had a problem while changing branch name. re-open.