lingua-py icon indicating copy to clipboard operation
lingua-py copied to clipboard

Distinguish between different variations of the same language

Open BLKSerene opened this issue 1 year ago • 5 comments

Hi, I'm wondering whether it is possible for lingua to distinguish between variations of the same language, for example: Simplified Chinese and Traditional Chinese, Norwegian Bokmål and Norwegian Nynorsk. AFAIK, langdetect could distinguish between Simplified and Traditional Chinese while other alternatives can't.

BLKSerene avatar Jul 15 '22 11:07 BLKSerene

Hi @BLKSerene, thank you for your request.

The library already distinguishes between Bokmal and Nynorsk. As for Simplified and Traditional Chinese, I could not find suitable training corpora yet which solely consist of either Simplified or Traditional Chinese. Do you know a good source for those perhaps?

pemistahl avatar Jul 19 '22 20:07 pemistahl

There are two UD Chinese corpora. Simplified Chinese: https://github.com/UniversalDependencies/UD_Chinese-GSDSimp Traditional Chinese: https://github.com/UniversalDependencies/UD_Chinese-GSD What are the requirements of the training data? And license?

BLKSerene avatar Jul 20 '22 03:07 BLKSerene

Ah, those look suitable, thank you.

For LanguageModelFilesWriter being able to create the language models, it needs training data in plain text without any annotations etc. So I would need to use a custom parser for the UD files first. The license should allow to use the language models created from the training data.

pemistahl avatar Jul 20 '22 09:07 pemistahl

The conllu package should suffice for parsing UD corpora: https://github.com/EmilStenstrom/conllu

BLKSerene avatar Jul 20 '22 13:07 BLKSerene

+1 on the feature request 🙏

yanqianglu avatar Aug 26 '23 22:08 yanqianglu