John Bauer

Results 1064 comments of John Bauer

I'm not sure when or if I'll have time to do a deep dive into this, but I will point out that fasttext should be better in general for agglutinative...

Do you have some examples for the problem of periods separated by spaces? The tokenizer is (currently) an LSTM over characters, so if it hasn't seen the data, it won't...

In double checking our tokenizer preparation code, I realize that only a limited number of datasets had the punctuation+space augmentation. I'll try to make that a bit more universal, or...

I retrained the DE tokenizer with occasional whitespace before the sentence final punctuation, and that fixed up this particular sentence splitting. I'll try to add that as a training mechanism...

This is now part of the 1.11.0 codebase, and the next time all models are retrained, they should pick up this improvement. DE in particular already has it

Aware of it. There's a limitation where we are saving plenty of things other than weights in the current file. Config strings and numbers, mostly. Would those still work? On...

Some of the models can be updated to use `weights_only=True` right away, but others require resaving with enums or other data structures removed. Will have to investigate some more.

I am finishing up some model training and will be able to make a new release with the updated models soon.

Got it, but that's the *main* branch. The updates merged in are in the dev branch, which at that line has `torch.load(... weights_only=True)` https://github.com/stanfordnlp/stanza/blob/5754ec0488636e90cdab26f43d44583d4efc99f0/stanza/models/common/pretrain.py#L60