Adriane Boyd comments

Results 347 comments of


                                            Adriane Boyd

Training NER models on multiple GPUs (not just one)

(No need for patreon: this is my job!) For local testing I only have one GPU, so I may not be much immediate help. The `spacy train` CLI doesn't have...

Tokenizer special cases do not work around infix punctuation

Thanks for the report! There are number of changes coming soon for spacy v3 that make the tokenizer more consistent, in particular for special cases that contain prefix/suffix/infix punctuation that...

Tokenizer special cases do not work around infix punctuation

It turned out that this has too much of an effect on the existing tokenizer settings, which were designed without this infix special case checking. It might be possible to...

It's sometimes difficult to initialize pipeline components in code

There is the same issue for the `lemmatizer` with its lookup tables. It doesn't call `validate_get_examples`, though, it just ignores it, so you can call `nlp.get_pipe("lemmatizer").initialize()`. The warning isn't helpful...

"Value Error: bytes object is too large" when using to_disk on large model.

Thanks for the report! This seems to be a hard-coded limit in msgpack: https://github.com/explosion/srsly/blob/03e8861eb08b3c33cc86e7c2e049e5b126538dff/srsly/msgpack/_packer.pyx#L44 We'll look into it, but since I'm not sure why msgpack has this limit, I'm not...

Suffix doesn't match for sentence ending in uppercase.

I can't think of anything major, but to be on the safe side we should test it with all the internal training corpora. Let me see...

Use mmap to share models across processes and speed up loading

Here is a related discussion: https://github.com/explosion/spaCy/discussions/5051

Issue resuming training on tansformer based NER

Oops, yeah that should be done differently. (But I don't understand why this ends up different in the second round than in the first?)

Issue resuming training on tansformer based NER

It isn't just for timing purposes because you're not actually running the final component (which is the NER model you're trying to train) unless you iterate over that generator. (Earlier...

Handle sentence boundaries from multiple components

Suggestions from @DomHudson in https://github.com/explosion/spaCy/issues/5050#issuecomment-590235869: > In my opinion the combination of `{None, True, False}` is not transparent or flexible enough to provide the information that it is currently trying...