Memory leak of MorphAnalysis object.
I have encountered a crucial bug, which makes running a continuous tokenization using Japanese tokenizer close to impossible. It's all due so memory leak of MorphAnalysis
How to reproduce the behaviour
import spacy
import tracemalloc
tracemalloc.start()
tokenizer = spacy.blank("ja")
tokenizer.add_pipe("sentencizer")
for _ in range(1000):
text = " ".join(["a"] * 1000)
snapshot = tracemalloc.take_snapshot()
with tokenizer.memory_zone():
doc = tokenizer(text)
tokenizer.max_length = len(text) + 10
import gc
gc.collect()
snapshot2 = tracemalloc.take_snapshot()
# Compare the two snapshots
p_stats = snapshot2.compare_to(snapshot, "lineno")
# Pretty print the top 10 differences
print("[ Top 10 ]")
# Stop here with pdb
for stat in p_stats[:10]:
if stat.size_diff > 0:
print(stat)
Run this script and observe how memory keeps growing:
It all happens due to the this line:
token.morph = MorphAnalysis(self.vocab, morph). I have checked the implementation itself and there is neither code for dealocation implemented, nor it supports the memory_zone.
We have observed similar issues in our pipeline. As you can see in this minimal example with da_core_news_md-model, the vocab keeps growing:
nlp = spacy.load("da_core_news_md")
test_texts = [
"Varmere vintre: Flere trækfugle forurener søerne",
"De højere vintertemperaturer giver problemer for landets søer.",
"Blandt andet fordi flere trækfugle sover på vandet.",
"I 1980'erne var der omkring 200 grågæs i Danmark om vinteren.",
"I dag kan der være helt op mod 100.000.",
]
for text in test_texts:
print("Vocab size before nlp:", len(nlp.vocab))
with nlp.memory_zone():
doc = nlp(text)
print("Vocab size after nlp:", len(nlp.vocab))
print("Vocab size out of memory zone:", len(nlp.vocab))
Output:
Vocab size before nlp: 2269
Vocab size after nlp: 2275
Vocab size out of memory zone: 2275
Vocab size before nlp: 2275
Vocab size after nlp: 2283
Vocab size out of memory zone: 2283
Vocab size before nlp: 2283
Vocab size after nlp: 2291
Vocab size out of memory zone: 2291
Vocab size before nlp: 2291
Vocab size after nlp: 2300
Vocab size out of memory zone: 2300
Vocab size before nlp: 2300
Vocab size after nlp: 2308
Vocab size out of memory zone: 2308
When trying to modify and access MorphAnalysis, an error occurs with hash in StringStore:
for text in test_texts:
with nlp.memory_zone():
doc = nlp(text)
for token in doc:
morph_str = str(token.morph)
if "Definite" in morph_str:
definite = token.morph.get("Definite")[0]
new_morph_str = morph_str.replace(definite, "foo")
token.set_morph(new_morph_str)
token.morph.get("Definite")
Output:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[24], [line 20](vscode-notebook-cell:?execution_count=24&line=20)
[18](vscode-notebook-cell:?execution_count=24&line=18) new_morph_str = morph_str.replace(definite, "foo")
[19](vscode-notebook-cell:?execution_count=24&line=19) token.set_morph(new_morph_str)
---> [20](vscode-notebook-cell:?execution_count=24&line=20) token.morph.get("Definite")
File ~/.venv/lib/python3.11/site-packages/spacy/tokens/morphanalysis.pyx:71, in spacy.tokens.morphanalysis.MorphAnalysis.get()
File ~/.venv/lib/python3.11/site-packages/spacy/strings.pyx:162, in spacy.strings.StringStore.__getitem__()
KeyError: "[E018] Can't retrieve string for hash '6324204924076910789'. This usually refers to an issue with the `Vocab` or `StringStore`."
@hynky1999 Are the Japanese morphological tags open-class, or are they a closed set? I've assumed that the morphology tags are a closed set and can be added to the string-store without problems.
Regarding deallocation, the MorphAnalysis object doesn't need deallocation code. It's a Python object with a C struct, and the C struct doesn't make any heap allocations. So the memory is freed as normal by Python's reference counting.
@lise-brinck Thanks for the example code. I've found a bug in the memory zone handling that causes this. I'll release a patch shortly.
Hi @honnibal, the I expressed myself incorrectly.
Yes you are right the MorpAnalysis object is indeed a struct. The issue is rather with it's creation as it calls self.vocab.morphology.add(features).
This results in allocating new tags without any dealocation here. It woud only get dealocated if the self.vocab.morphology object would be deleted but I don't think it ever happens and for sure not with respect to mem zones.
https://github.com/explosion/spaCy/blob/master/spacy/morphology.pyx#L135-L136