John Bauer comments

Results 1064 comments of


                                            John Bauer

Various wrong lemmas

If you encounter any others, we can add individual errors to the training data and hopefully improve the overall quality. Also, the transformer based models (default_accurate) is significantly better overall....

Various wrong lemmas

I definitely like the idea of adding lemmas it is getting wrong. For both German and Italian we already did some bulk downloads. For German we used some Wiktionary data,...

Various wrong lemmas

I don't envision `compound_lemma` happening any time soon. That'd be a whole new project, with limited training resources, and as you have observed, the information you want is already encoded...

Various wrong lemmas

German: ``` >>> pipe = stanza.Pipeline("de", package="default_accurate", processors="tokenize,pos,lemma,depparse") >>> doc = pipe('Sie schneidet sich die Nägel.') >>> print("{:C}".format(doc)) # text = Sie schneidet sich die Nägel. # sent_id = 0...

Various wrong lemmas

Those better results were with the "accurate" models. The snippets I pasted should show you how to use them On Thu, Dec 11, 2025, 11:32 PM Yolp ***@***.***> wrote: >...

Various wrong lemmas

For the `fr_combined` models (default), `s'` lemmatized as `soi` occurs in all four treebanks we used. https://github.com/UniversalDependencies/UD_French-GSD https://github.com/UniversalDependencies/UD_French-Sequoia https://github.com/UniversalDependencies/UD_French-Rhapsodie https://github.com/UniversalDependencies/UD_French-ParisStories

John Bauer

Various wrong lemmas

Various wrong lemmas

Various wrong lemmas

Various wrong lemmas

Various wrong lemmas

Various wrong lemmas

How the TokenizeProcessor batching work on unpunctuted input?

How the TokenizeProcessor batching work on unpunctuted input?

How the TokenizeProcessor batching work on unpunctuted input?

How to speed up for large dataset