spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

[Documentation] Serializing Pipeline unclear

Open DomHudson opened this issue 1 year ago • 2 comments

Summary

On this page, it claims to serialize a pipeline, you use the following methods:

config = nlp.config
bytes_data = nlp.to_bytes()

and that you you must take care of storing both and then loading from disk.

However, it also appears that:

nlp.to_disk('directory_name')

coupled with:

spacy.load('directory_name')

works and this is a lot more simple. The code executes and I can call a built nlp object on text successfully.

Questions

  1. Does this approach actually work identically?

    1. If so, can we update the documentation? The nlp.config and to_bytes seem like implementation details rather than the API for serializing?
    2. I didn't see a mention on this page that you can load the persisted pipeline from disk with spacy.load, should this be added?
  2. If this approach doesn't work, I think we should call this out and build a function/method that handles loading and saving to disk with a single call - this seems better than having to write your own disk persistence for the config and bytes object. What do you think?

Thanks!

Which page or section is this issue related to?

https://spacy.io/usage/saving-loading

DomHudson avatar Sep 30 '24 13:09 DomHudson