spaCy [Documentation] Serializing Pipeline unclear

[Documentation] Serializing Pipeline unclear

Open DomHudson opened this issue 1 year ago • 2 comments

On this page, it claims to serialize a pipeline, you use the following methods:

config = nlp.config
bytes_data = nlp.to_bytes()

and that you you must take care of storing both and then loading from disk.

However, it also appears that:

nlp.to_disk('directory_name')

coupled with:

spacy.load('directory_name')

works and this is a lot more simple. The code executes and I can call a built nlp object on text successfully.

Does this approach actually work identically?
1. If so, can we update the documentation? The nlp.config and to_bytes seem like implementation details rather than the API for serializing?
2. I didn't see a mention on this page that you can load the persisted pipeline from disk with spacy.load, should this be added?
If this approach doesn't work, I think we should call this out and build a function/method that handles loading and saving to disk with a single call - this seems better than having to write your own disk persistence for the config and bytes object. What do you think?

Thanks!

https://spacy.io/usage/saving-loading

Sep 30 '24 13:09 DomHudson