tomotopy Inference is extremely slow

I have a large corpus (30M docs) and a pretrained inference-only tomotopy model. I want to find the argmax topic for each doc in the corpus and have found through benchmarking (see script here) that list-based inference is faster (by a factor of ~2) than corpus-based. What I find is that using defaults (on a 40-core machine), inference is expected to take 125 days. This seems extremely slow considering training the model took 3h on a 10M document corpus.

My inference script is as follows:

import numpy as np
import tomotopy as tp

from math import ceil
from functools import partial

from tqdm import tqdm

def get_highest_lda(model, topic_words, docs):
    corpus = [model.make_doc(doc.split()) for doc in docs]
    topic_dist, _ = model.infer(corpus)
    k = np.argmax(topic_dist, axis=1)
    return [topic_words[k_] for k_ in k]

def chunk(l, n):
    for i in range(0, len(l), n):
        yield l[i:i+n]

if __name__ == "__main__":
    docs = [line.strip() for line in open('corpus.txt')]
    lda = tp.LDAModel.load('model.bin')
    # get top 100 words from each topic
    N = 100
    topic_words = [" ".join(word for word, _ in lda.get_topic_words(i, top_n=N)) for i in range(lda.k)]
    # batch size for batched inference
    chunk_size = 512
    map_fn = partial(get_highest_lda, lda, topic_words)
    results = tqdm(map(map_fn, chunk(docs, chunk_size)), total=ceil(len(docs) / chunk_size))
    for chunk in results:
        for doc in chunk:
            print(doc)

Aug 01 '22 16:08 erip

Hi do you have some solution to resolve this bug？

Nov 16 '22 11:11 wangyi888

遇到的问题一样，占用内存100G，开了40核，对于长度5000字以内的文本进行推理，2条/s

Jun 26 '23 07:06 xiaohuzi1996

Any luck on this?

Aug 15 '23 07:08 narayanacharya6

Looking at a flamegraph of inference ^1, it seems like a large portion of the inference time is spent here. I'm trying to track this down now, but it seems like there's a lot of waiting.

Sep 12 '23 14:09 erip

tomotopy tomotopy copied to clipboard

Inference is extremely slow

tomotopy
tomotopy copied to clipboard