John Bauer

Results 1064 comments of John Bauer

If you encounter any others, we can add individual errors to the training data and hopefully improve the overall quality. Also, the transformer based models (default_accurate) is significantly better overall....

I definitely like the idea of adding lemmas it is getting wrong. For both German and Italian we already did some bulk downloads. For German we used some Wiktionary data,...

I don't envision `compound_lemma` happening any time soon. That'd be a whole new project, with limited training resources, and as you have observed, the information you want is already encoded...

German: ``` >>> pipe = stanza.Pipeline("de", package="default_accurate", processors="tokenize,pos,lemma,depparse") >>> doc = pipe('Sie schneidet sich die Nägel.') >>> print("{:C}".format(doc)) # text = Sie schneidet sich die Nägel. # sent_id = 0...

Those better results were with the "accurate" models. The snippets I pasted should show you how to use them On Thu, Dec 11, 2025, 11:32 PM Yolp ***@***.***> wrote: >...

For the `fr_combined` models (default), `s'` lemmatized as `soi` occurs in all four treebanks we used. https://github.com/UniversalDependencies/UD_French-GSD https://github.com/UniversalDependencies/UD_French-Sequoia https://github.com/UniversalDependencies/UD_French-Rhapsodie https://github.com/UniversalDependencies/UD_French-ParisStories

It looks for multiple line breaks for paragraphs. However, if the text is more than 1000 characters long, it chunks it into batches of 1000 for going through the tokenizer...

It is true that individual sentences are not split. There is a flag to separate individual sentences into their own batch: `depparse_min_length_to_batch_separately` This might help, but it depends on just...

Good point. I added a line for that to the depparse page.

More threads? Better hardware? On Wed, Apr 19, 2023, 6:48 PM Linlp ***@***.***> wrote: > Hello, my corpus is 700G, is there any way to speed up? > > —...