tomotopy icon indicating copy to clipboard operation
tomotopy copied to clipboard

Inference against a corpus is segfaulting

Open erip opened this issue 1 year ago • 4 comments

I am migrating away from model.make_doc to tp.util.Corpus and am finding that using Corpus segfaults. My tiny repro is here:

#!/usr/bin/env python3

import time

import numpy as np
import tomotopy as tp

# Workaround for `str.split` received unknown kwarg user_data
class WSTok:
    def __call__(self, raw, **kwargs):
        return raw.split()

def get_highest_lda_list(model, N, docs):
    corpus = [model.make_doc(doc.split()) for doc in docs]
    topic_dist, ll = model.infer(corpus)
    k = np.argmax(topic_dist, axis=1)
    return [" ".join(e[0] for e in model.get_topic_words(k_, top_n=N)) for k_ in k]

def get_highest_lda_corpus(model, N, docs):
    corpus = tp.utils.Corpus(tokenizer=WSTok(), stopwords=[])
    corpus.process(doc for doc in docs)
    topic_dist, ll = model.infer(corpus)
    k = np.argmax([doc.get_topic_dist() for doc in topic_dist], axis=1)
    return [" ".join(e[0] for e in model.get_topic_words(k_, top_n=N)) for k_ in k]

if __name__ == "__main__":
    docs = [line.strip() for line in open('10_line_pretokenized_corpus.tsv')]
    lda = tp.LDAModel.load('tm_model.bin')
    N = 10
    t0 = time.time()
    list_res = get_highest_lda_list(lda, N, docs)
    print(f"Took {time.time() - t0} seconds (list)")
    corpus_res = get_highest_lda_corpus(lda, N, docs)
    print(f"Took {time.time() - t0} seconds (corpus)")
    t0 = time.time()
    assert all(e == f for e, f in zip(corpus_res, list_res))

When I run this, I see:

Took 19.61503529548645 seconds (list)
Segmentation fault (core dumped)

Running this with catchsegv shows these relevant lines:

/usr/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f171bd43210]
/home/erip/.venv/lib/python3.8/site-packages/_tomotopy_avx2.cpython-38-x86_64-linux-gnu.so(_ZNSt6vectorIjSaIjEE12emplace_backIJRjEEEvDpOT_+0x7c)[0x7f16d662705c]
/home/erip/.venv/lib/python3.8/site-packages/_tomotopy_avx2.cpython-38-x86_64-linux-gnu.so(_Z10makeCorpusP16TopicModelObjectP7_objectS2_+0x681)[0x7f16d6db8f51]
/home/erip/.venv/lib/python3.8/site-packages/_tomotopy_avx2.cpython-38-x86_64-linux-gnu.so(_Z9LDA_inferP16TopicModelObjectP7_objectS2_+0x25a)[0x7f16d6d71b8a]

which seems to point here... maybe d.get() is null?

erip avatar Aug 01 '22 14:08 erip

It seems like my WSTok is the issue and that it doesn't meet the expected interface (__call__ should return (tok, start, stop)). If I use tp.util.SimpleTokenizer(pattern="\w+"), it seems to be OK... this is somewhat unexpected, though so maybe documentation can be slightly improved.

erip avatar Aug 01 '22 15:08 erip

Hi, @erip Could you share some pieces of the file 10_line_pretokenized_corpus.tsv for reproducing? A similar error is not reproduced in the sample text I have, so it is not easy to determine the cause. If you share the file where the problem is reproduced, it will be of great help to find the cause.

bab2min avatar Aug 08 '22 16:08 bab2min

@bab2min are you using the WSTok here? It should cause the error

erip avatar Aug 08 '22 17:08 erip

Ooops sorry @erip , I forgot this feed entirely. Yes, I used WSTok and it worked well. Since I don't have tm_model.bin and 10_line_pretokenized_corpus.tsv, I ran the code, which is modifed like:

class WSTok:
    def __call__(self, raw, **kwargs):
        return raw.split()

docs = ["this is test text", "this is another text", "somewhat long text...."]

corpus = tp.utils.Corpus(tokenizer=WSTok(), stopwords=[])
corpus.process(doc for doc in docs)
for doc in corpus:
    print(doc)
# it will print
# <tomotopy.Document with words="this is test text">
# <tomotopy.Document with words="this is another text">
# <tomotopy.Document with words="somewhat long text....">

I suspect that some lines in the 10_line_pretokenized_corpus.tsv corrupt the inner c++ code.

bab2min avatar Sep 14 '22 12:09 bab2min