contextualized-topic-models icon indicating copy to clipboard operation
contextualized-topic-models copied to clipboard

Korean Language Support

Open StrangeFate opened this issue 3 years ago • 3 comments

#71 Because of the structure of the Korean language, needed some different tokenizers instead of a white space tokenizer. Since konlpy is one of the famous Korean NLP python package, I've added konlpy to do tokenization and removing numbers.

test.txt test_preprocessed.txt

Tested on colab with Korean articles(5000, brief article).

GF1 GF2

Capture from colab

I did read the contributing guidelines, but since it's my first contribution on Github there will be some lack of information that needs to be provided. Feel free to request any additional information.

Want to know why Korean language can't be tokenized with a white space tokenizer? Check out https://en.wikipedia.org/wiki/Korean_grammar#Substantives

StrangeFate avatar Jan 14 '22 09:01 StrangeFate

Hello @StrangeFate,

this looks great, I'd like to introduce this in a specific location of the source and to install this using a selective install. Point is, I'd also like to add tokenizers for other languages but I'd like not to force people to install all the tokenizers.

Give me a few days to think about this and thanks for the great work :)

vinid avatar Jan 17 '22 07:01 vinid

Hello @StrangeFate,

this looks great, I'd like to introduce this in a specific location of the source and to install this using a selective install. Point is, I'd also like to add tokenizers for other languages but I'd like not to force people to install all the tokenizers.

Give me a few days to think about this and thanks for the great work :)

Hello @vinid,

I agree with you about forcing people to install all the tokenizers is not the appropriate way. That will be too much for users(I knew it as soon as I began modifying codes, but I'm kinda newbie to the python package so I couldn't come up with a better solution about it. :/ ) Anyway, I agree with what you said and it will be grateful if you could consider a better way to implement this.

Thanks for the great work!

StrangeFate avatar Jan 17 '22 07:01 StrangeFate

had a problem while changing branch name. re-open.

StrangeFate avatar Jan 17 '22 07:01 StrangeFate