LinguaCafe RFC: Use stanza model for Finnish

This PR is a request for comments about using stanza model for Finnish and is not meant to be merged in current state, hence it is draft.

Unfortunately, Finnish lemmatization is not very accurate. I ran slightly updated benchmark: https://github.com/aajanki/finnish-pos-accuracy and found that spacy lemmatization model used in LinguaCafe has F1=0.842, whereas default stanza model for Finnish gives F1=0.958.

I tried to use stanza with https://github.com/explosion/spacy-stanza adapter (see PR code). It works. Also, code changes are generalizable to other languages (stanza supports over 70 languages).

There is a huge downside though: the size of resulting docker image, which is mostly because NVIDIA drivers I guess, which are automatically downloaded with pytorch installation.

$ podman system df -v  # before
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    85a508ecb0cb  54 minutes  1.095GB     424.3MB      671.1MB      0
...
$ podman system df -v  # after
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    2c95c59fdae3  26 minutes  6.906GB     5.53GB       1.376GB      0
...

In conclusion, it is possible to significantly increase accuracy for Finnish (and probably some other languages) while not increasing code complexity at the cost of image size.

What do you think about Finnish lemmatization accuracy and introducing stanza?

lemma_f1_speed

Before (lemma is whole word – incorrect): Screenshot_20240510_163709

After (lemma is correct): Screenshot_20240510_165630

May 10 '24 14:05 rominf

Oh wow, this looks great! I didn't know about this.

I would love to add this. We actually have a language install system, so the image size would not increase, it would only take up space for users who actually use this language.

Does this require a GPU? Can you please test what the size would be without the nvidia driver?

My only problem with it would be GPU dependence plus that my laptop is probably too weak to test this. After adding the 2 missing Spacy languages my plan was to use different tokenizers, it would be VERY useful if I could keep using Spacy for more languages.

Thank you so much for working on this!

@sergiolaverde0 You may be interested in this.

May 10 '24 14:05 simjanos-dev

@simjanos-dev I am so glad you liked it!

Yes, it works without GPU: I just added installation of CPU version of torch on separate line. The size of the image dropped significantly:

$ podman system df -v  # before
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    85a508ecb0cb  54 minutes  1.095GB     424.3MB      671.1MB      0
...
$ podman system df -v  # after - GPU
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    2c95c59fdae3  26 minutes  6.906GB     5.53GB       1.376GB      0
...
$ podman system df -v  # after - CPU
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    04233fafac2c  About a minute  2.804GB     1.428GB      1.376GB      0

What are my next actions? Fix the documentation (add references about stanza to all places where spacy is mentioned, write proper commit and PR messages) and undraft the PR, or is there something else that needs to be done?

May 10 '24 15:05 rominf

Looking at the URL for the Pytorch install this doesn't need a GPU since it uses CPU as the computing platform.

I heard we can reduce the size of that install by compiling Pytorch from source without the unnecessary features but I haven't done it before and I don't know by how much we can cut it.

If the accuracy increase is noticeable for enough languages maybe we should consider the possibility of making it the default. I'm concerned with performance when using CPU so that's another thing to check.

I see a future here, but it will take effort.

May 10 '24 15:05 sergiolaverde0

My enthusiasm has dropped a lot, I thought it would be much smaller. The model size is still huge compared to the 20-50MB model we used before.

A few more questions:

How much RAM does it use compared to the old model?
How much more space does it take, if you install +1 or +2 languages? On hugging face the model is 350MB zipped. I'm asking this because I assume there are some shared parts, and not every language will add 1.8GB.

If the accuracy increase is noticeable for enough languages maybe we should consider the possibility of making it the default.

I don't think I want to do that. Some users already had issues with the ram. I've seen attempted installs on raspberry pi-s, small free tier hosted servers and old laptops. I myself have an old laptop. And I also want to host LinguaCafe on a VPS in the future and try to optimize it. I want to rather make LinguaCafe smaller if possible by default. However, I definitely want to add these models as well as an option.

What are my next actions?

I'm not sure, I will need some time to figure out what I would like to do. I will more than likely have a problem with testing this myself.

Since this is only needed for lemmas (except for languages that have no spaces or have readings), what if we would use a huge amount of text, and generate a list of lemmas that we would use for linguacafe? For most languages, that is the only value that is being added by using an other model or tokenizer than the multilingual Spacy one.

2 other options would be: adding them as extra installable languages like "Finnish (large)", or adding an API that let people use other tokenizers. It would be easy to copy the current python container, and modify it and add different models.

What do you think?

May 10 '24 15:05 simjanos-dev

Well seeing how we have already more or less frozen the features for v12.0 and since I have an assignment for this weekend I suggest to give it some time.

Next week I will try to compile Pytorch from source and see what would be the absolute minimum size so we can make a better informed decision.

For the time being the option for larger models is my favourite.

@rominf sorry if we take our time, but Simjanos is right to be concerned about accessibility of the hardware requirements.

May 10 '24 16:05 sergiolaverde0

Well seeing how we have already more or less frozen the features for v12.0 and since I have an assignment for this weekend I suggest to give it some time.

Implementing this will definitely take a lot of time. I want to add everything to linguacafe that I can, but I cant do it at the rate requests are coming in. Its been an insane progress in the last 4 months since release.

May 10 '24 16:05 simjanos-dev

Please take your time! I will post my results, so that you have a food for thought meanwhile.

You are right about importance of accessibility of the hardware requirements: my mistake, I was not thoughtful about this.

I will write about Finnish only, since I have not tried to do lemmatization in other languages.

Stanza language support is split to multiple models. For lemmatization only tokenize, mwt, lemma models are required and pos is optional, however it greatly improves the accuracy. Size of tokenize, mwt, lemma models is 6.8 MiB (six and eight MiB), size of tokenize, mwt, lemma, pos models is 182.7 Mib.

My PC info:

Processors: 28 × Intel® Core™ i7-14700K
Memory: 62.5 GiB of RAM
Operating System: Fedora Linux 40
Kernel Version: 6.8.8-300.fc40.x86_64 (64-bit)
Python: 3.9.19

Here are the results of lemmatization of Universal Dependencies tree bank:

model                  F1    token/s
spacy-fi_core_news_lg  0.871 25191
spacy-fi_core_news_md  0.870 24768
spacy-fi_core_news_sm  0.842 27826
stanza-fi (no pos)     0.879 4631
stanza-fi (with pos)   0.958 1794

I also measured RAM usage on lemmatization of Alice in Wonderland in Finnish using scalene. Here is the script:

import collections
import sys

text = open("pg46569.txt").read()
if sys.argv[1] == "spacy":
    import spacy
    spacy.require_cpu()
    nlp = spacy.load("fi_core_news_sm", disable=['ner', 'parser'])
    # Just to be sure nothing extra happens on first nlp object call
    nlp("")
    doc = nlp(text)
    # Consume generator to avoid extra memory allocations
    collections.deque(((token.text, token.lemma_) for token in doc), maxlen=0)
elif sys.argv[1] == "stanza":
    import stanza
    # This will download only needed models to ~/stanza_resources/ and store them for next runs 
    nlp = stanza.Pipeline("fi", processors="tokenize,mwt,lemma", verbose=False, use_gpu=False)
    #nlp = stanza.Pipeline("fi", processors="tokenize,mwt,pos,lemma", verbose=False, use_gpu=False)
    # Just to be sure nothing extra happens on first nlp object call
    nlp("")
    doc = nlp(text)
    # Consume generator to avoid extra memory allocations
    collections.deque(((token.text, token.lemma) for sentence in doc.sentences for token in sentence.words), maxlen=0)
elif sys.argv[1] == "simplemma":
    from simplemma import simple_tokenizer, lemmatize
    doc = simple_tokenizer(text, iterate=True)
    collections.deque(((token, lemmatize(token, lang="fi")) for token in text), maxlen=0)

Results:

model                  max RAM (GiB) total time (s)
spacy-fi_core_news_sm  0.9           2.7
stanza-fi (no pos)     0.4           7.5
stanza-fi (with pos)   2.0           23.4
-- Python 3.12.3
spacy-fi_core_news_sm  0.9           2.7
stanza-fi (no pos)     0.4           5.6
stanza-fi (with pos)   2.0           18.2
simplemma              0.5           1.7
-- Python 3.11.9
spacy-fi_core_news_sm  0.9           2.8
stanza-fi (no pos)     0.4           6.6
stanza-fi (with pos)   2.0           19.2
simplemma              0.6           1.7

As you can notice from the script, I don't use spacy_stanza library anymore, but call stanza directly: there are no benefits for this specific task.

This is the size of the image now (without pos):

$ podman system df -v  # before
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    85a508ecb0cb  54 minutes  1.095GB     424.3MB      671.1MB      0
...
$ podman system df -v  # after
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    9cabe369f271  13 seconds      2.074GB     1.429GB      644.8MB      0
...

To sum up, stanza without pos processor is a bit more accurate on Finnish than spacy, takes significantly less disk space and RAM, however, much slower. Stanza with pos processor is much more accurate on Finnish than spacy, but takes significantly more disk space and RAM and tremendously slower.

The proposal about having multiple variants of language is my favorite as well!

Do you want me to do a benchmark of spacy vs stanza for other languages?

UPD: added results for Python 3.12 for Alice in Wonderland test. UPD: added simplemma for Alice in Wonderland test. UPD: added results for Python 3.11 for Alice in Wonderland test.

May 11 '24 08:05 rominf

This is a really detailed test report, thank you so much!

Operating System: Fedora Linux 40
Kernel Version: 6.8.8-300.fc40.x86_64 (64-bit)
Processors: 28 × Intel® Core™ i7-14700K
Memory: 62.5 GiB of RAM

Wow. I have i5-8250u and 8GB ram.

The proposal about having multiple variants of language is my favorite as well!

I think we should go with that as well to provide the best experience possible.

At first I was thinking about it the wrong way. My first idea was to have multiple languages for different tokenizers, but I realized it would be extremely difficult to implement, since Language names are used at a ton of places.

It is however reasonably simple to switch tokenizers. So we can just make the tokenizer selectable on the admin page without separating them into their own language.

Do you want me to do a benchmark of spacy vs stanza for other languages?

I mostly interested in that if we add multiple languages, would the additional disk space required decrease, due to shared dependencies.

I think the latest 2GB disk size you commented is very reasonable to be added as an option. But if the models themselfs are so small, is there any way to decrease the disk space further? Can we remove Spacy, and use Stanza by itself to save space? I know it returns a different format, but I can write a different tokenizer function for it.

The tokenizer is quite slower, but the PHP side of processing the text takes time as well, so it might won't be that much of an issue, plus users can decide which one they want to use.

I would like to help implement this, but I won't be able to provide testing, or support for users who will have issues with it, because my laptop would die trying to run this.

I will think about how to implement a tokenizer selector. We should probably rebrand installable languages to installable packages or something.

May 11 '24 09:05 simjanos-dev

What if we extend your idea about installable lemmatizers even further? Since some people want to run LinguaCafe in constrained environments, what if:

The size of the usable linguacafedev_python image decrease significantly?
Not just models, but model runners (spacy, stanza) be installable on demand just in a few seconds?

This can be done!

My proposal is to preinstall simplemma instead of spacy, so that the image is minimal. It has low footprint and runs very fast (it can be seen from the table in my previous message – I added simplemma there) – a good fit for raspberry pi. If user selects enhanced models spacy or stanza is installed in a few seconds using uv, which installs stanza just in few seconds (5 seconds on my machine)! This is just one extra call to uv for installation of the package in venv.

I created four venvs using uv: empty, simplemma, spacy, and stanza. Here is what I got:

pytorch takes the most place, as @sergiolaverde0 expected.

Here is the showcase of how fast uv is:

(stanza-pip) rominf@rominf-fedora /t/venv> time pip install stanza --extra-index-url https://download.pytorch.org/whl/cpu
...
________________________________________________________
Executed in   20.89 secs    fish           external
   usr time    8.45 secs    0.00 micros    8.45 secs
   sys time    0.84 secs  506.00 micros    0.84 secs

(stanza-uv) rominf@rominf-fedora /t/venv> time uv pip install stanza --extra-index-url https://download.pytorch.org/whl/cpu
Resolved 20 packages in 2.30s
Downloaded 20 packages in 2.52s
Installed 20 packages in 285ms
...
________________________________________________________
Executed in    5.13 secs    fish           external
   usr time    1.52 secs    0.00 micros    1.52 secs
   sys time    0.95 secs  335.00 micros    0.95 secs

Docker image building with uv becomes much, much faster and here is the footprint:

$ podman system df -v  # before
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    85a508ecb0cb  54 minutes  1.095GB     424.3MB      671.1MB      0
...
$ podman system df -v  # after
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    5f19a370db3b  13 seconds      327.9MB     164.9MB      163MB        0
...

PS: please have a look at "UPD" in my previous message: stanza on Python 3.12 is quite a bit faster than on Python 3.9.

May 11 '24 10:05 rominf

Hi, I have a few questions, hoping not to derail this too much:

How is performance in Python 3.11? That is what Debian 12 currently packages so probably RPis and other SBC do the same, and thus it makes for a good baseline of "most users will have this or newer".
Is there any documentation about Arm compatibility of these models? I don't seem to be finding any. Currently our Python image is only built for Amd64 because some Spacy languages have dependencies that are not available for Arm. Apple Silicon runs the image via virtualization and other devices are unsupported.
Off topic but I noticed you are using podman, have you encountered any issues? Some months ago users had trouble trying to run the images with rootless podman and the only solution we had was asking them to use rootful podman or docker.

About suing uv: I'm really not a big fan of using pre-1.0 software in "production" for critical tasks. However it might seriously make install of extra components faster and if we are having so many of those the benefits might outweight the issues they cause. If we go this route we will have to pin the version and update it manually unlike the rest of the tools we use.

I see simplemma does not consume less RAM than Stanza without pos. Sure it is faster but I think we could skip on it at least for the time being to reduce mental overload while planning what to do. We might also try a survey to ask users how much they care about text import times.

I also want to remark: Stanza supports languages that Spacy doesn't, so this might solve our Vietnamese issue and maybe our Tagalog issue as a side effect.

May 11 '24 13:05 sergiolaverde0

I will comment on it more later, just a few quick comments from phone.

I also want to remark: Stanza supports languages that Spacy doesn't, so this might solve our Vietnamese issue and maybe our Tagalog issue as a side effect.

I want to check other non spacy tokenizers as well, and compare the sizes. I think Spacy is a good default option based on its size, and if theres an other smaller tokenizer for Vietnamese, I would prefer that instead of Stanza.

Theres also an option for using Spacy multilingual model and simple lemmatizer together. It would be really good and easy solution for Czech and Latin lemma support.

We could replace spacy for most languages with simple lemmatizer, but there are 3 points to keep in mind:

Some languages have or will have gender tagging support.
We need to make sure that simple lemmatizer is accurate enough.
Part of speech may will be an important core feature. Im thinking about adding an option to treat the same word with different pos as 2 different unique words, so they can have more accurate lemmas and readings. This is just an idea, and wont be implemented soon.

I am thinking about it. I have no strong opinions about it, but I feel like using Spacy is a good default option when available.

Tokenization speeds importance will decrease in the future, because I want to make a queue for them, and users will be able to start reading after the first chapter is finished.

May 11 '24 13:05 simjanos-dev

@sergiolaverde0

I updated the table in the message above. Python 3.11 is a bit slower than 3.12.
pytorch is available on ARM (https://download.pytorch.org/whl/torch_stable.html, look for cpu/torch-2.3.0, arm64.whl, and aarch64.whl). As for stanza itself, it is fully written in Python and there is no arch in PyPI classifiers. It should run on ARM fine. Of course, checking it on cloud server would not harm.
Yes, I use podman. There were issues with SELinux. Here are short instructions (disclaimer: I am not SELinux expert and I don't know if it is the most secure way, yet this is for sure more secure than disabling SELinux):

$ git clone -b deploy https://github.com/simjanos-dev/LinguaCafe.git linguacafe && cd linguacafe
$ sudo semanage fcontext -a -t svirt_sandbox_file_t "~/linguacafe(/.*)?"
$ sudo restorecon -vR ~/linguacafe  # repeat this command every time after downloading dictionaries into storage/app/dictionaries/
$ sudo setsebool container_manage_cgroup 1
$ sudo chmod 777 -R ~/linguacafe/  # as per original instruction
$ podman-compose up -d

May 11 '24 17:05 rominf

@simjanos-dev

I would like to help implement this, but I won't be able to provide testing, or support for users who will have issues with it, because my laptop would die trying to run this.

Thank you! I can do testing and support users for this feature. Also, I do not think pos version of stanza will behave differently in any way comparing to non-pos version (except for accuracy): and you should be able to run non-pos version. :-)

May 11 '24 17:05 rominf

and you should be able to run non-pos version. :-)

I'll try it out sometime.

Thank you! I can do testing and support users for this feature.

In that case I am open for adding Stanza as an additional option for at least Finnish. If it goes well, I think we can add more languages and Stanza tokenizers. I will do everything on the Laravel and front-end side, and can also do Python if needed. (Honestly I am a bit worried about having parts of the code that I don't test/support completely.)

What are my next actions?

Currently I think the only thing needed on the Python/docker side is to make it installable like other language packages.

I will experiment with simplemma for Czech, Latin, Ukrainian and Welsh in the future. It also has Hindi which was a requested language.

I wanted to split up tokenizer.py for a while, because it keeps growing. Now it will be kind of necessary. Currently it should have 3 files: tokenizer, import and models(I'm not sure if this one can be separated). I will probably do it for v0.13 or v0.14.

It might take a while for me to do my part, I will be working a bit less on linguacafe, and will work on parts of it that I want to use, because I feel a bit burned out.

And thank you so much for working on this! Both Stanza and Simplemma are great tools for tokenizing, I didn't even know about them.

May 11 '24 19:05 simjanos-dev

I did some really quick mockups last night and was able to reduce it to 1.81 GB by changing the base to python:slim and ensuring no cache is used when installing Pytorch.

I will add the first change for v0.13 regardless of what happens with the tokenizers because there's no reason not to. While doing this I realized we can use Python 3.12 regardless of anything so sorry for wasting your time with that pointless inquiry.

Later I will test how does size evolve as I replace more and more languages with the Stanza variants, and check if I can shrink Pytorch more.

May 14 '24 14:05 sergiolaverde0

Did you mean replace to test the image size, or did you mean you will replace all spacy packages with stanza?

Edit: I think it was the former. Im a bit slow today and was confused.

May 14 '24 17:05 simjanos-dev

Did you mean replace to test the image size, or did you mean you will replace all spacy packages with stanza?

Replace where possible to see the image size I ended up with, and also because the easy way to map the models to a language to test them was to ditch the Spacy counterparts anyways.

And after doing so to see if space savings from shared dependencies could shrink this image I found that:

Eight languages, including Norwegian, Sweedish, Croatian and Danish, either lack a mwt model or lack any model altogether, so if I try to install them with the generic python3 -c 'import stanza; stanza.download("x", processors="tokenize,mwt,lemma")' it fails. I kept those on their Spacy variants to get the image to build.
English also lacks a default mwt, since none of the options listed on the docs are marked as a default. It installs just fine but I don't know how will that impact accuracy and performance, but solving it should be easy enough if we dig into the docs deeper.
Darn, those Stanza models are tiny! Most if not all of them were less than 7MB, so by using them as a replacement of the usually larger Spacy models I reduce the image down to 1.59GB. I ended up with a total of 11 Stanza languages and 9 Spacy languages if we include the multilingual as its own.

I'm now to test that languages other than Finish actually work, as in, to check if they actually load and can tokenize a paragraph. I will be grateful for any help so I have built a test on my fork, pull it with docker pull ghcr.io/sergiolaverde0/linguacafe-python-service:stanza. Depending on how this goes I will see how things behave with the languages whose models and dependencies were too big to be shipped by default like Japanese and Russian.

If the decrease in performance is not that big of a deal, if the rest of the languages can be worked around to be used, and if they all follow the pattern of using less RAM than Spacy counterparts, I can vouch for this to be our new default. But those ifs are doing quite the heavylifting.

Edit: And yes I'm shunning away from compiling Pytorch until we exhaust alternatives, today I saw their setup.py is 1500 lines long.

May 15 '24 00:05 sergiolaverde0

A few things to keep in mind regarding replacing default tokenizers with stanza:

I do not know yet, but part of speech may be needed for an important feature in the future. I haven't decided on it yet.
Gender tagging is very important to keep where it's available. Some languages in spacy support it, but haven't added support for them yet in linguacafe. I'm pretty sure it exist in Danish, Swedish, Italian and Spanish.
Japanese has a post processing where I combine multiple words into one after the tokenization, which relies on the word splitting being the way it is in spacy, and having correct POS tagged. Chinese and Thai users would also lose data if their words are being split differently. I can't speak for their accuracy, except for Japanese, which I find pretty good, except for 2 problems that my post processing introduced.

~~I will work on moving this post processing from Laravel to Python today. It is just an additional function, so I will merge this in on Friday after release if there are no PR-s touching the file. If there will be, I'll modify my code to avoid creating conflicts with other people's work.~~ Did not work.

May 15 '24 06:05 simjanos-dev

https://rominf.github.io/spacy-vs-stanza

May 19 '24 17:05 rominf

Thank you for the tests! This is VERY detailed list.

I'll do a few other things this week, but on the weekend or next week I am ready to do my part adding more tokenizers to linguacafe starting with Finnish.

May 20 '24 09:05 simjanos-dev

You are welcome! It was fun and educational to work on this benchmark.

Here is the release with results in CSV format: https://github.com/rominf/spacy-vs-stanza/releases/tag/v0.1.0.

Feel free to ask any questions, but please note that from 14:00 UTC today until the morning of May 27th (so, ~one week), I will be offline.

May 20 '24 10:05 rominf

I did some rough DA and in summary:

Stanza with pos is always more accurate, except when unavailable which is only Macedonian among the 59 languages. I think Spacy is available for a couple more languages that were not benchmarked.
Spacy with pos is still smaller than Stanza with pos in 23 of the 59 languages. Stanza without pos is always smaller I think.

I have been thinking about this and I'm still unsure on how to go about it, I think switching any of the existing tokenizers is technically a breaking change since the same phrase could get tokenized differently if there was a switch between importing two texts and I don't think the database would be happy with that.

Re-tokenizing everything when "upgrading" to a bigger model is not a good idea either, so the potential use case for Simplemma as a test bed for people to mess around until they install the "main" model is probably better discarded.

For Finnish specifically we could make it work as an extra language and remove it from the base image, but then it would be a gargantuan extra model of 2 GB. And there is still the matter of the languages exclusive to Stanza. Thoughts?

May 27 '24 23:05 sergiolaverde0

I think switching any of the existing tokenizers is technically a breaking change since the same phrase could get tokenized differently if there was a switch between importing two texts and I don't think the database would be happy with that.

It probably wouldn't be a problem, except for Japanese, Chinese and Thai.

For Finnish specifically we could make it work as an extra language and remove it from the base image, but then it would be a gargantuan extra model of 2 GB. And there is still the matter of the languages exclusive to Stanza. Thoughts?

I think the best option is to keep the current spacy languages as default, and add Stanza for Finnish as optional installable language, probably with the option to choose between Spacy and Stanza on the admin page, so already existing users won't have to install 2GB extra package.

For new languages that spacy doesn't have, I think having stanza as an option for 20+ extra language is great! Probably using other tokenizers would take up large place as well (have no idea, just guessing).

(Currently requested languages that I want to add next: Vietnamese, Tagalog, Swahili and Hindi.)

If we decide to add switchable tokenizers, maybe we should also think through installable languages, and rename it to installable packages or something similar. I've got some tools linked like #280 and some TTS libraries, that I don't think we should add to the main package either. ( I don't plan on adding any of them anytime soon.)

(Sorry if I wrote something confusing, I wrote this message very late.)

May 28 '24 00:05 simjanos-dev

I am working the importing system currently. After that(2-3 weeks) I want to work on stanza. Would someone like to do the python side (I would still do it myself if not)?

Jun 29 '24 08:06 simjanos-dev

I can do the Python side, I will be having a bit more spare time starting this week. Adding the option to install the extra-extra-large Finish model with Pytorch I can probably implement in one hour ~~and then spend five testing and debugging~~ since I can reuse the code we use for the regular models. How shall I name it for the API calls?

Jun 29 '24 11:06 sergiolaverde0

I think we should integrate it to the current api calls. If I remember currently, I just receive a simple array of the installed languages. I think we should change it to look like this:

[
    'japanese': [
        'spacy',
    ],
    'finnish': [
        'spacy',
        'stanza',
    ]
]

This way users can install multiple tokenizers. For the install function we could use/pass a language and a tokenizer post variable.

I am currently working on the tokenizer.py file. I'll rewrite the tokenizers themself if you want, or you can do it after I am finished. I'm changing a lot of things, so there would probably be a ton of conflicts. But I'm not touching the model functions, so you could work on those, save them somehow, and after I'm done you could just apply/copy-paste the functions to the latest tokenizer.py file.

I will probably won't have time to work on this from Monday to Friday, and I think I will be finished with the tokenizer.py today or tomorrow.

Jun 29 '24 12:06 simjanos-dev

Is that really practical, given that the spacy model for Finish would always be present? Currently we only check for the extra languages, so that a default install returns an empty object.

Jun 29 '24 12:06 sergiolaverde0

Is that really practical, given that the spacy model for Finish would always be present? Currently we only check for the extra languages, so that a default install returns an empty object.

Sorry, I made a mistake, this is what I meant:

[
    'japanese': [
        'spacy',
    ],
    'finnish': [
        'stanza',
    ]
]

We should only handle extra languages on python, I'll handle the selection between stanza and spacy on the webserver side. But there will be languages where multiple packages can be installed in the future other than tokenizers, which I think should be handled with the same function. So maybe it would look like this:

[
    'japanese': [
        'spacy',
        'manga-ocr',
    ],
    'finnish': [
        'stanza',
    ]
]

We could use the current ["plain", "array"] format, but it would be more complicated on the webserver side to combine the array I get from python with the config file of already installed languages and tokenizers, because for selecting tokenizers I will have to use an array structure like the one above.

Please feel free to ask anything or suggest an other method. I'm not sure if I explained it correctly.

Jun 29 '24 12:06 simjanos-dev

@sergiolaverde0 I think I am done with working on the tokenizer.py file (99%). The latest version is in the feature/websockets-vue branch. I will merge this into dev probably next week, but maybe even today.

Jun 30 '24 08:06 simjanos-dev

I've mostly finished working on job queues. There are a few small tasks left, I plan to work on stanza next weekend.

Jul 07 '24 14:07 simjanos-dev

LinguaCafe LinguaCafe copied to clipboard

RFC: Use stanza model for Finnish

LinguaCafe
LinguaCafe copied to clipboard