spaCy
spaCy copied to clipboard
Make it possible to merge Vocab/StringStore instances
Feature description
Here is the background: in order to use a spacy document it is necessary to have the correct Vocab/StringStore. But when documents are created/processed in a distributed/multiprocessing way, different subsets of documents will get processed, linked to the vocab of the process's vocab (in nlp). In order to save a processed document efficiently, one uses "to_disk" to save it without the vocab. When resuming processing or just when one wants to load the document later, a vocab is needed that has all the entries from all the parallel processes combined in order to ensure the document can be deserialised.
Could the feature be a custom component
I do not think so.
To at least partly answer my own question here: it looks like this could trivially achieved by running several from_disk calls from the different stored vocabs on a new vocab instance. The result appears to be the merged vocab.
Can somebody confirm that this is the proper way to do this and there are no unforeseen consequences of this approach before I close this?
Hmm, I don't think that will work because it looks like the strings get reset with .from_disk(). (I don't know why, maybe it's a carry-over from the old strings implementation before hashes were used for everything?)
I think the best way currently is to iterate over the strings for each stored vocab and add them to your master vocab instance with strings.add().
Hmm it would be good to find out for sure. (and maybe document in more detail what from_disk is supposed to do when called with several different files on the same object!)
@adrianeboyd what exactly do you mean by "strings get reset" and how could that get tested to illustrate that something is done wrong?
I tested this with these three python scripts, to be run in separate python processes (to avoid any effect of in-memory caching etc):
createvocab1.py:
import spacy
from spacy.vocab import Vocab
nlp1 = spacy.load("en_core_web_sm")
txt1 = "This is a document. It has four sentences. John Smith works for Apple in San Francisco. He drives a Porsche."
doc1 = nlp1(txt1)
vocab1 = doc1.vocab
vocab1.to_disk("vocab1.vocab")
createvocab2.py:
import spacy
from spacy.vocab import Vocab
nlp2 = spacy.load("en_core_web_sm")
txt2 = "This is a document. It has three sentences. John Smith works for Apple in San Francisco"
doc2 = nlp2(txt2)
vocab2 = doc2.vocab
vocab2.to_disk("vocab2.vocab")
combinevocabs.py:
from spacy.vocab import Vocab
vocab1 = Vocab()
vocab2 = Vocab()
vocab1.from_disk("vocab1.vocab")
vocab2.from_disk("vocab2.vocab")
print("Size vocab1", len(vocab1))
print("Size vocab2", len(vocab2))
vocab12 = Vocab()
vocab12.from_disk("vocab1.vocab")
vocab12.from_disk("vocab2.vocab")
print("Size vocab12", len(vocab12))
print("is in vocab1", vocab1["is"])
print("is in vocab2", vocab2["is"])
print("is in vocab12", vocab12["is"])
print("Porsche in vocab1", "Porsche" in vocab1)
print("Porsche in vocab2", "Porsche" in vocab2)
print("Porsche in vocab12", "Porsche" in vocab12)
print("four in vocab1", "four" in vocab1)
print("four in vocab2", "four" in vocab2)
print("four in vocab12", "four" in vocab12)
print("three in vocab1", "three" in vocab1)
print("three in vocab2", "three" in vocab2)
print("three in vocab12", "three" in vocab12)
This gives the following output:
Porsche in vocab1 True
Porsche in vocab2 False
Porsche in vocab12 True
four in vocab1 True
four in vocab2 False
four in vocab12 True
three in vocab1 False
three in vocab2 True
three in vocab12 True
which I think is what I should expect if this works correctly.
Could somebody with better understanding of the inner workings of spacy please give feedback if the method outlined in the previous comment is a proper method to merge vocabs or not? As we are working a lot with parallel processing corpora, the situation that each process ends up with its own vocab instance arises often, but then loading an arbitrary processed document would be easiest with a merged vocab that works for ALL documents, rather than having to know for each document which vocab file to use. Also, a merged vocab would be by far smaller than having to keep all those largely overlapping separate vocabs in memory!
See also https://stackoverflow.com/questions/58303670/how-to-merge-spacy-vocab-instances/63204731#63204731
Very late reply here, but just to follow up - the method for merging vocabs by calling from_disk on the same vocab multiple times may work, but it's not the way the API is intended to be used. Like Adriane recommended, I think it's better to iterate over the strings.
Note that since this issue was created we have added dev docs on the StringStore and Vocab that should help with understanding the internals.
I cannot imagine that I am was the only one running processing pipelines in parallel these days, so I think a dedicated API method for merging vocabs properly would be a good idea, rather than letting people find out about the problem and everyone implementing the merging themselves? That would also have the advantage that whatever future changes are made to the vocab logic and content, the merging method could always adapt properly.
Sorry again for not following up on this sooner. It hadn't occurred to me last time I commented, but looking at your specific case of serializing and deserializing docs, you can use the DocBin for that - it includes strings in the serialized data, so when deserializing, it doesn't matter if the input vocab already contains them or not - they'll be added if necessary. That should remove any issues with vocab inconsistency you have.
The use of DocBins is separate from having a merge function for Vocabs, but I think it should cover most related use cases. If you (or anyone reading this) has an example where that's insufficient, do let us know.
So you mean for each processed document, create a new docbin, add just that document to it and then serialize the whole doc bean as the single document representation on disk would be the recommended approach?
Of course, my own approach of combining vocabularies would actually have the advantage to be more storage efficient as the strings get only saved once, for all documents, in the end. But the docbin approach would not require any post-processing bothering, so would then be the much easier way to do it.
It wasn't clear to me from your previous questions that you were serializing Docs one at a time. In that case it's true you won't get the efficiency benefits of the DocBin, as each one will have to save all the strings in Doc (though only once per unique token). But it shouldn't add significant other overhead, and as you noted it should be very easy to do.
The potential for extra efficiency in that use case is a good point though, so thanks for mentioning it, as it is a potential use case for a merge function - we'll keep considering it.