tomotopy
tomotopy copied to clipboard
Inference is extremely slow
I have a large corpus (30M docs) and a pretrained inference-only tomotopy model. I want to find the argmax topic for each doc in the corpus and have found through benchmarking (see script here) that list-based inference is faster (by a factor of ~2) than corpus-based. What I find is that using defaults (on a 40-core machine), inference is expected to take 125 days. This seems extremely slow considering training the model took 3h on a 10M document corpus.
My inference script is as follows:
import numpy as np
import tomotopy as tp
from math import ceil
from functools import partial
from tqdm import tqdm
def get_highest_lda(model, topic_words, docs):
corpus = [model.make_doc(doc.split()) for doc in docs]
topic_dist, _ = model.infer(corpus)
k = np.argmax(topic_dist, axis=1)
return [topic_words[k_] for k_ in k]
def chunk(l, n):
for i in range(0, len(l), n):
yield l[i:i+n]
if __name__ == "__main__":
docs = [line.strip() for line in open('corpus.txt')]
lda = tp.LDAModel.load('model.bin')
# get top 100 words from each topic
N = 100
topic_words = [" ".join(word for word, _ in lda.get_topic_words(i, top_n=N)) for i in range(lda.k)]
# batch size for batched inference
chunk_size = 512
map_fn = partial(get_highest_lda, lda, topic_words)
results = tqdm(map(map_fn, chunk(docs, chunk_size)), total=ceil(len(docs) / chunk_size))
for chunk in results:
for doc in chunk:
print(doc)
Hi do you have some solution to resolve this bug?
遇到的问题一样,占用内存100G,开了40核,对于长度5000字以内的文本进行推理,2条/s
Any luck on this?