spaCy [WIP] Support adding pipeline component instances

Add a method Language.add_pipe_instance that allows you to add component instances to the pipeline. This is primarily intended to make it easier to assemble a pipeline dynamically in a script, which is useful for exploration or for situations where you don't care so much about serialisation.

Components that are added this way cannot be reconstructed automatically during deserialisation. Instead, a pipe_instances argument is added to spacy.load().

Example usage 1: creating an NLP object with a function, and using nlp.to_disk() and nlp.from_disk().

If users wish to assemble the pipeline themselves in a function, they don't need to worry about the config system. They just have to make sure they use the same function to construct the pipeline each time.

def make_nlp():
    nlp = spacy.blank("en")
    tagger = Tagger(nlp.vocab, model=DEFAULT_TAGGER_MODEL, overwrite=False, scorer=tagger_score, neg_prefix="!", label_smoothing=0.0)
    nlp.add_pipe_instance(tagger)
    return nlp

nlp = make_nlp()
nlp.to_disk("/tmp/my_nlp")
# Later
nlp = make_nlp()
nlp.from_disk("/tmp/my_nlp")

Example usage 2: Passing component instances to spacy.load()

If pipe instances are added directly, we won't be able to reconstruct them in spacy.load(). Instead of just making the pipeline impossible to load this way, you can pass the missing instances using the pipe_instances argument. This approach is kind of impractical for components that need the Vocab instance, but I definitely think it's better than just not supporting it.

vocab = Vocab()
tagger = Tagger(vocab, model=DEFAULT_TAGGER_MODEL, overwrite=False, scorer=tagger_score, neg_prefix="!", label_smoothing=0.0)
nlp = spacy.load("/tmp/my_nlp", vocab=vocab, pipe_instances={"tagger": my_tagger})

I added some logic to fetch the vocab instance from a pipe instance, so that nlp = spacy.load("/tmp/my_nlp", pipe_instances={"tagger": my_tagger}) would work. The value of that logic is a bit dubious.

Questions for discussion

We could overload add_pipe, so that you can just pass the pipe instance in there. I've kept this separate to avoid interaction with existing features, at least until we're happy with it. In the long term I think the overload is a bit more consistent with the general design of the API.
We could support pipe metadata in add_pipe_instance, but then how would we serialise it? I felt it was better to avoid the complication, and if you want those things to be correct you can assemble the pipeline via the config system. I could easily be missing implications of this metadata not being there though.
Naming of pipe_instance argument?
Fishing the vocab instance out of the pipe instances seems dubious. Probably we should just not do this, and ask that the vocab argument be set?
Wording of the warnings about vocab instances?
Should we check for vocab instances more robustly? We could go through the __dict__, but that still doesn't cover nested objects. I felt that the vocab member was a decent value thing to check, albeit not watertight.

Checklist

[x] I confirm that I have the right to submit this contribution under the project's MIT license.
[x] I ran the tests, and all new and existing tests passed.
[ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

Jun 10 '23 16:06 honnibal

They just have to make sure they use the same function to construct the pipeline each time.

To continue a bit on the usage 1 example, this could become problematic inbetween spaCy versions, no? If we'd change the default tagger model, this make_nlp function bypasses the versioning and could break or result in different results?

Jun 21 '23 08:06 svlandeg

The "Example usage 1" to me is crucial to understand how this functionality would/should look like in terms of UX.

I used a tagger for this, which is admittedly a bit of a fiddly use-case for it. Assembling a matcher or something by hand might be more common, or if you're just putting together some arbitrary components.

Clearly this isn't user-friendly either, because we've now lost a way to just go with the default settings when defining the tok2vec. Also, if we want to better support this approach, we'll need to better document the functions underlying the registry names (right now docs focus on "spacy.HashEmbedCNN.v2" and don't even mention build_hash_embed_cnn_tok2vec)

These are good points to keep in mind. I would suggest getting the core functionality in first, and then at least theres some way to do this, even if the support for the overall workflow in the docs and the library is shaky.

We don't necessarily need to document every one of these registered functions. What we do need are good ways for users to navigate from registry names to the location in the source code that the name resolves to. The VSCode extension I think does this, right? We need something like that on the command-line as well.

To continue a bit on the usage 1 example, this could become problematic in between spaCy versions, no? If we'd change the default tagger model, this make_nlp function bypasses the versioning and could break or result in different results?

Well, let's say people go and reference undocumented stuff with this workflow. Yes, they can have their things break from this between versions. That's a risk that they should take in a well-informed way.

Trying to prevent every mistake or shield users from potential problems can be at the expense of flexibility. If we ask what we expect the Python API to be able to do, from a simple flexibility standpoint, to me it definitely should be able to add components by instance.

Jun 21 '23 08:06 honnibal

I used a tagger for this, which is admittedly a bit of a fiddly use-case for it. Assembling a matcher or something by hand might be more common, or if you're just putting together some arbitrary components.

So which use-cases are we really aiming to support here? The matcher doesn't seem applicable either - at least the built-in Matcher can't actually be used as a pipeline component, as it returns a list of matches on __call__, not the annotated docs like other components do.

If we're thinking about stateless functions like the evil_component test case added here, it feels like a small effort to add a one-liner to allow for nlp.add_pipe("evil_instance"):

@Language.component("evil_instance")
def evil_component(doc):
    ...

I can imagine other use-cases where you want to initiate the component class directly from within code though, like for spacy-llm's (v0.3.0) LLM component:

from spacy_llm.pipeline import LLMWrapper
from spacy_llm.tasks import NERTask
from spacy_llm.backends.rest.openai import OpenAIBackend
from spacy_llm.cache import BatchCache

cache = BatchCache(path=None, batch_size=342, max_batches_in_mem=42)
backend = OpenAIBackend(config={"model": "text-davinci-003"}, strict=False, max_tries=4, interval=1.0, max_request_time=40)
instance = LLMWrapper(vocab=nlp.vocab, task=NERTask, backend=backend, cache=cache, save_io=False)

nlp.add_pipe_instance(instance, name="my_llm")

Like with the tagger use-case I tried to work through in my first example, you end up specifying all values of the Python objects you need to use, because we're currently handling all the defaults through the registered functions & the config.

Does the above code feel like a satisfying solution to offer to users? Or do we need to rethink again how we define these default values? I just want to understand the level of support we want to offer here.

We don't necessarily need to document every one of these registered functions. What we do need are good ways for users to navigate from registry names to the location in the source code that the name resolves to. The VSCode extension I think does this, right? We need something like that on the command-line as well.

Agreed - this would be good functionality to have!

Jun 26 '23 10:06 svlandeg

Does the above code feel like a satisfying solution to offer to users? Or do we need to rethink again how we define these default values? I just want to understand the level of support we want to offer here.

I think it's better than nothing, yeah.

I'm interested in whether there's a way we can somehow provide the defaults in Python without duplicating them in the configs. Perhaps we could inspect the function to build the default config? Anyway we can leave that for a separate question. I'd still want add_pipe_instance even if we don't.

If people build a suite of their own functions and things, they can manage their own defaults, and they should be able to express a reasonable Python API for assembling a pipeline without having to go through our config system.

Jun 26 '23 12:06 honnibal

spaCy spaCy copied to clipboard

[WIP] Support adding pipeline component instances

Questions for discussion

Checklist

spaCy
spaCy copied to clipboard