lute-v3 Add a Mandarin Chinese Parser

This PR is for a Mandarin Chinese parser.

Credit to discord user "whosnick" for the parsing code. I added unit tests and database logic.

Summary of Changes

Added lute/parse/mandarin_parser.py for MandarinParser class and logic.

Modified lute/parse/registry.py to include MandarinParser as a new type of parser.

Added test_MandarinParser.py for test coverage.

Modified test/conftest.py and added mandarin_chinese fixture for testing.

Modified lute/db/schema/baseline.sql to add Mandarin Chinese to the demo sql.

Added jieba (https://pypi.org/project/jieba/) to requirements for parsing because Mandarin Chinese has no clear word delimiters and so it is difficult to parse. It is under the MIT License.

Added pypinyin (https://pypi.org/project/pypinyin/) to requirements.txt. It is used for readings of Chinese characters.It is under the MIT License.

May 03 '24 00:05 cghyzel

Thank you @cghyzel for taking the time to do this, much appreciated!

My only concern are the new requirements -- are they heavyweight, do they work on windows (without any annoying wheel builds etc), do they pull in extra heavy data at runtime? I haven't looked into these, so any references etc you can provide will help me with the review. Cheers!

May 03 '24 02:05 jzohrab

@jzohrab both of these repositories are lightweight, pure python modules with simple wheels. Neither creates additional dependencies (I actually ran the pip freeze > requirements.txt as instructed) and they are performant. Segmenting this text took .331 seconds on my machine according to the debug logs.

Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/xh/ll3sxfw50jdgpd77thf7py9h0000gn/T/jieba.cache
DEBUG:jieba:Loading model from cache /var/folders/xh/ll3sxfw50jdgpd77thf7py9h0000gn/T/jieba.cache
Loading model cost 0.331 seconds.
DEBUG:jieba:Loading model cost 0.331 seconds.
Prefix dict has been built successfully.

I can try this on a windows VM later.

May 03 '24 04:05 cghyzel

Additionally, I opened this PR for the language def.

May 03 '24 05:05 cghyzel

Super, thanks!

DEBUG:jieba:Building prefix dict from the default dictionary ...

Does this pull down a dictionary somewhere automagically, or does the user need to load some data files?

May 03 '24 05:05 jzohrab

Super, thanks!

DEBUG:jieba:Building prefix dict from the default dictionary ... Does this pull down a dictionary somewhere automagically, or does the user need to load some data files

It's automagical. The default dictionary is a DAG that is included in the library. Jieba* traverses the dictionary using dynamic programming techniques to segment the text.

This is the only method used

The jieba.cut function accepts three input parameters: the first parameter is the string to be cut; the second parameter is cut_all, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.

May 03 '24 06:05 cghyzel

Screenshot 2024-05-02 at 11 43 54 PM And ~2 seconds was the loading performance on my 2 cpu Windows VM for a first time loading of the Jieba dictionary. Installation was painless.

May 03 '24 06:05 cghyzel

Thanks @cghyzel for the changes. When the current beta (3.3.3b1) is done and good, I'll start pulling this in to my dev environment.

I'm also taking a look at another long-running Mandarin fork that was out there from an early user: the chinese_new branch of https://github.com/fanyingfx/lute-v3. That branch has several additional changes, far outside of Mandarin: showing term readings, changing the Japanese parser, adding automatic Lemma ...

I don't intend to replace this PR with the work in the other branch, as this PR is much more self-contained. I'm not even going to try pull all of the other fork's work into Lute (e.g. the automatic Lemma stuff), but I'd like to see if these two projects can get on the same footing again, just for cohesion. This isn't critical, forks are gonna fork.

May 05 '24 03:05 jzohrab

The jieba and pypinyin libs total about 50 MB, which approx doubles the size of the install. Here's the current breakdown of my "prod" lute instance (which doesn't have jieba):

Not sure if this should be considered a dealbreaker or not, given that I have about 400 MB of data (the db, audio, and various backups).

May 08 '24 23:05 jzohrab

Looks good. I loaded the sample story and things seem to be working correctly. On load of the story, console showed: Building prefix dict from the default dictionary ... Dumping model to file cache /var/folders/hl/rc8mnsq13tj57pj8_b8v4kk40000gn/T/jieba.cache ... Loading model cost 0.355 seconds. ... Prefix dict has been built successfully. That dumped file is 8 MB. The sample story in lang_defs was parsed quickly 👍

For me, the main question is about adding 50 MB to the pip install (doubling the size) for something that will only be used by a small segment of users. It's not a dealbreaker by any means. This question has more to do with prepping for future lang libraries and requirements, and wondering if this is something that we should consider now with this PR.

May 09 '24 00:05 jzohrab

@jzohrab As much as I want this integrated into the codebase, I agree that it is best to consider the long term now.

We can discuss it in depth on discord, but what do you think of me closing this PR and instead opening one for the plugin refactor branch I was working on earlier. It has the same mandarin codebase and tests.

May 09 '24 05:05 cghyzel

Hi @cghyzel — could you make a branch with that other plug-in code you had, and open a PR? I have some time over the next few days to get some initial thoughts down in a doc (various questions, etc) and prepare for a smarter discussion. Cheers!

May 10 '24 17:05 jzohrab

@jzohrab Will do!

May 10 '24 17:05 cghyzel

lute-v3 lute-v3 copied to clipboard

Add a Mandarin Chinese Parser

Summary of Changes

lute-v3
lute-v3 copied to clipboard