contextualized-topic-models Korean Language Support

#71 Because of the structure of the Korean language, needed some different tokenizers instead of a white space tokenizer. Since konlpy is one of the famous Korean NLP python package, I've added konlpy to do tokenization and removing numbers.

test.txt test_preprocessed.txt

Tested on colab with Korean articles(5000, brief article).

GF1 GF2

Capture from colab

I did read the contributing guidelines, but since it's my first contribution on Github there will be some lack of information that needs to be provided. Feel free to request any additional information.

Want to know why Korean language can't be tokenized with a white space tokenizer? Check out https://en.wikipedia.org/wiki/Korean_grammar#Substantives

Jan 14 '22 09:01 StrangeFate

Hello @StrangeFate,

this looks great, I'd like to introduce this in a specific location of the source and to install this using a selective install. Point is, I'd also like to add tokenizers for other languages but I'd like not to force people to install all the tokenizers.

Give me a few days to think about this and thanks for the great work :)

Jan 17 '22 07:01 vinid

Hello @StrangeFate,

this looks great, I'd like to introduce this in a specific location of the source and to install this using a selective install. Point is, I'd also like to add tokenizers for other languages but I'd like not to force people to install all the tokenizers.

Give me a few days to think about this and thanks for the great work :)

Hello @vinid,

I agree with you about forcing people to install all the tokenizers is not the appropriate way. That will be too much for users(I knew it as soon as I began modifying codes, but I'm kinda newbie to the python package so I couldn't come up with a better solution about it. :/ ) Anyway, I agree with what you said and it will be grateful if you could consider a better way to implement this.

Thanks for the great work!

Jan 17 '22 07:01 StrangeFate

had a problem while changing branch name. re-open.

Jan 17 '22 07:01 StrangeFate

contextualized-topic-models contextualized-topic-models copied to clipboard

Korean Language Support

contextualized-topic-models
contextualized-topic-models copied to clipboard