tomotopy
tomotopy copied to clipboard
Inference against a corpus is segfaulting
I am migrating away from model.make_doc
to tp.util.Corpus
and am finding that using Corpus segfaults. My tiny repro is here:
#!/usr/bin/env python3
import time
import numpy as np
import tomotopy as tp
# Workaround for `str.split` received unknown kwarg user_data
class WSTok:
def __call__(self, raw, **kwargs):
return raw.split()
def get_highest_lda_list(model, N, docs):
corpus = [model.make_doc(doc.split()) for doc in docs]
topic_dist, ll = model.infer(corpus)
k = np.argmax(topic_dist, axis=1)
return [" ".join(e[0] for e in model.get_topic_words(k_, top_n=N)) for k_ in k]
def get_highest_lda_corpus(model, N, docs):
corpus = tp.utils.Corpus(tokenizer=WSTok(), stopwords=[])
corpus.process(doc for doc in docs)
topic_dist, ll = model.infer(corpus)
k = np.argmax([doc.get_topic_dist() for doc in topic_dist], axis=1)
return [" ".join(e[0] for e in model.get_topic_words(k_, top_n=N)) for k_ in k]
if __name__ == "__main__":
docs = [line.strip() for line in open('10_line_pretokenized_corpus.tsv')]
lda = tp.LDAModel.load('tm_model.bin')
N = 10
t0 = time.time()
list_res = get_highest_lda_list(lda, N, docs)
print(f"Took {time.time() - t0} seconds (list)")
corpus_res = get_highest_lda_corpus(lda, N, docs)
print(f"Took {time.time() - t0} seconds (corpus)")
t0 = time.time()
assert all(e == f for e, f in zip(corpus_res, list_res))
When I run this, I see:
Took 19.61503529548645 seconds (list)
Segmentation fault (core dumped)
Running this with catchsegv
shows these relevant lines:
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f171bd43210]
/home/erip/.venv/lib/python3.8/site-packages/_tomotopy_avx2.cpython-38-x86_64-linux-gnu.so(_ZNSt6vectorIjSaIjEE12emplace_backIJRjEEEvDpOT_+0x7c)[0x7f16d662705c]
/home/erip/.venv/lib/python3.8/site-packages/_tomotopy_avx2.cpython-38-x86_64-linux-gnu.so(_Z10makeCorpusP16TopicModelObjectP7_objectS2_+0x681)[0x7f16d6db8f51]
/home/erip/.venv/lib/python3.8/site-packages/_tomotopy_avx2.cpython-38-x86_64-linux-gnu.so(_Z9LDA_inferP16TopicModelObjectP7_objectS2_+0x25a)[0x7f16d6d71b8a]
which seems to point here... maybe d.get()
is null?
It seems like my WSTok
is the issue and that it doesn't meet the expected interface (__call__
should return (tok, start, stop)). If I use tp.util.SimpleTokenizer(pattern="\w+")
, it seems to be OK... this is somewhat unexpected, though so maybe documentation can be slightly improved.
Hi, @erip
Could you share some pieces of the file 10_line_pretokenized_corpus.tsv
for reproducing?
A similar error is not reproduced in the sample text I have, so it is not easy to determine the cause.
If you share the file where the problem is reproduced, it will be of great help to find the cause.
@bab2min are you using the WSTok here? It should cause the error
Ooops sorry @erip , I forgot this feed entirely.
Yes, I used WSTok
and it worked well.
Since I don't have tm_model.bin
and 10_line_pretokenized_corpus.tsv
, I ran the code, which is modifed like:
class WSTok:
def __call__(self, raw, **kwargs):
return raw.split()
docs = ["this is test text", "this is another text", "somewhat long text...."]
corpus = tp.utils.Corpus(tokenizer=WSTok(), stopwords=[])
corpus.process(doc for doc in docs)
for doc in corpus:
print(doc)
# it will print
# <tomotopy.Document with words="this is test text">
# <tomotopy.Document with words="this is another text">
# <tomotopy.Document with words="somewhat long text....">
I suspect that some lines in the 10_line_pretokenized_corpus.tsv
corrupt the inner c++ code.