Adriane Boyd
Adriane Boyd
As a note, I think warnings get tricky because they'd show up in `spacy train` output.
Thanks for the report! The info provided makes this look specific to the `trf` model, in particular `curated-tokenizers`. If you have a minute, could you create a new venv without...
If you also install `sentencepiece` in the new venv?
In general this seems to be a known issue related to `sentencepiece`, which is vendored in `curated-tokenizers`. I'm not currently sure exactly which conditions are necessary for you to run...
`nlp.max_length` is not a hard internal constraint, but rather a kind of clunky way to protect users from confusing OOM errors. It was set with the "core" pipelines and a...
Thanks for the suggestion! I think that this description is slightly confusing for users, since `nlp.max_length` itself will behave the same way for all languages. What we need to highlight...
Thanks for the examples, they'll be helpful when looking at how to improve the lemmatizers in the future!
The French lemmatizer in the v3.7 trained pipelines is a rule-based lemmatizer that depends on the part-of-speech tags from the statistical tagger to choose which rules to apply. In these...
We wouldn't use the lexique data in our pipelines due to the non-commercial clause in the CC BY-NC license, but if the license works for your use case and you'd...
Overall it sounds like a lookup lemmatizer, which doesn't depend on context, might be a better fit for these kinds of examples. You can see how to switch from the...