gensim Sequential lda optimization

Hi team,

This pull request addresses the issue of the slow computational time of LdaSeqModel #1545 . In the current stage, it contains cythonized methods of SSLM and LdaPost classes. It is not the final version, I plan to add the following changes:

[x] cythonize sslm_counts_init function of sslm class
[x] refactor code (e.g. chain.sslm_counts_init(topic_obs_variance, topic_chain_variance, sstats) in LdaSeqModel class, here class method is used, however I don't know why) + many others
[x] add more documentation for functions
[ ] implement DIM model
[ ] tune memory usage and allocation
[ ] fix bug which I found out in the update_obs function of sslm class. You can compare the logic in the gensim implementation and Blei's lab implementation

By now the performance improvement is 3-5 times on toy datasets e.g. https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_ldaseqmodel.py. I still haven't tried to benchmark for larger datasets as it takes a while to train original implementation, however, the speed up seems to be much higher based on synthetic data I tried.

Extracted topics of optimized model differ a bit from the original one (gensim tests are passed). The exact places in the code which causes it are

update_zeta function of sslm class
np.negative at the end of df_obs Everything except these two places works absolutely the same. From my perspective, it is not a bug in the code, just issues with the precision. (I hope so)

@gojomo @mpenkov @piskvorky What do you think about the idea of such changes in gensim in general, code structure and code itself particularly? I plan to finish with the optimization and bug fixes I mentioned above by the end of this month.

P.S. I've created the pull request before ending to be sure that these changes are valuable for the project :)

Jun 14 '21 19:06 Sharganov

I've run a test using setup from ldaseqmodel example notebook in gensim docs. As for now, the current time distribution is present in the picture below. Almost all time is spent in scipy Python optimization code. The time which is spent in Cython marked with red rectangles. The time in rectangles (right up corner) is in nanoseconds.

The test was done using Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz

ldaseq_optimized_news

The raw output of a profiler + file for kcachegrind is on gdrive

Jun 23 '21 20:06 Sharganov

Thanks! Do you have any high-level numbers, on a larger dataset? To get a sense of the scale of these optimizations, now-vs-before.

Jun 23 '21 20:06 piskvorky

I trained both implementations using the first 10 years of the UN General Debate dataset . The code I used:

from gensim.corpora import Dictionary
from gensim import utils
from gensim.models import LdaSeqModel

import numpy as np
import pandas as pd

data = pd.read_csv("datasets/un-general-debates.csv")
data["year"] = data["year"].apply(lambda x: int(x))
data = data[data["year"] < 1980]

paragraphs = [[(p,year) for p in speech.split("\n") if len(p) > 10] for speech, year  in zip(data.text, data.year)]
paragraphs = pd.DataFrame(data=sorted([p for s in paragraphs for p in s], key=lambda x: x[1]), columns=["text", "year"])

time_slices = paragraphs.year.value_counts().values
paragraphs = [utils.simple_preprocess(p) for p in paragraphs.text]

dictionary = Dictionary(paragraphs)
corpus = [dictionary.doc2bow(text) for text in paragraphs]

del data, paragraphs

model = LdaSeqModel(corpus=corpus,id2word=dictionary, num_topics=15, time_slice=time_slices)

The results: Number of docs: 61674 Vocabulary Length: 29485 Number of topics: 15 Number of time slices: 10

Old Implementation : ~47 hours New Implementation ~3 hours 40 mins

Jul 03 '21 13:07 Sharganov

Thanks for the timings. That's a massive speedup!

And are the new-vs-old results comparable, quality-wise?

Jul 03 '21 13:07 piskvorky

Sure... I accidentally rewrote a trained old model :(

So for now I just trained with two years of data using the same code + removed stopwords using nltk set. I've put the resulting topics in the excel table . As for me topics are very similar.

Regarding the model trained with 10 years of data, let me train it one more time :) I'll back in a couple of days. Meanwhile, I'll find a more formal way to compare two SeqLda models.

Jul 04 '21 15:07 Sharganov

I've trained the model one more time. Pickled models are available on gdrive. I calculated the coherence score for both models as it was done in the docs tutorial. I am not sure that it is the right choice to compare classic vs optimized models using a coherence score, I think they should be equal in a perfect case. Nevertheless, they are very close for all time slices.

The code I used:

from gensim.models.coherencemodel import CoherenceModel
import pickle

[CoherenceModel(topics=model.dtm_coherence(time=i), corpus=corpus, 
                dictionary=dictionary, coherence='u_mass').get_coherence() 
 for i in range(len(time_slices))]

Results:

Time slice	Old	New
1	-2.390386933137473	-2.3849280478010373
2	-2.3854065916265434	-2.38503617012231
3	-2.412550400281885	-2.412624753824853
4	-2.403873956103539	-2.402234764724972
5	-2.4228081694247994	-2.416472944689486
6	-2.462690192130912	-2.466086114010153
7	-2.4189737189539	-2.4169199318713814,
8	-2.484822837691009	-2.488989346989372
9	-2.5564440634792547	-2.557404323965962
10	-2.551624889028128	-2.54892377159758]

Jul 07 '21 20:07 Sharganov

Thanks so much. This is helpful. How can we use the new version? Is it built in the package, or we need to revise any of the codes ourselves?

Sep 01 '21 15:09 lkcao

Good point. @Sharganov can you fix the bug you found above & tune the memory usage? Let's get your work merged & released!

Sep 01 '21 16:09 piskvorky

@Sharganov Ping

Sep 28 '21 14:09 mpenkov

Thats some impressive work @Sharganov. Any news on whether this is anywhere close to being merged/released?

Jan 12 '22 14:01 florianlorisch

Hi everyone, sorry for a long break. I fixed the bug which I mentioned. Also I'll look at memory usage.

Jan 15 '22 19:01 Sharganov

@Sharganov what did you find about the memory? We're planning a new release, and this looks a solid candidate for including if finished up.

Feb 19 '22 19:02 piskvorky

@piskvorky I am planning to look at memory on aprox. 28/02 - 04/03, I'll have a lot of free time on these dates. Hope it fits your plans for release.

Feb 22 '22 13:02 Sharganov

No problem, thanks. If we don't manage to finish the optimization in this release, we can include it in the next.

Feb 22 '22 13:02 piskvorky

This is great. Looking forward to the release! Can this be used in its current state?

Mar 08 '22 17:03 jlevy44

Really looking forward to the faster release, @piskvorky @Sharganov ! Great job!

Mar 25 '22 08:03 bhargavvader

Hi everyone, any updates here?

Jun 16 '22 20:06 jlevy44

@Sharganov how did your memory optimization go? Looks like there is quite some demand for this feature!

Jun 16 '22 20:06 piskvorky

What's the status? Do you need any help/volunteer to bridge gaps against changes in the code base?

Jun 23 '23 07:06 maciejskorski

Please, any updates so far? I really need to enhance the efficiency of the LdaSeqmodel. If we set aside concerns about memory optimization for the time being, would it be possible for me to make local file modifications to load this version? Or does this version have any errors or conflicts with the current one? Thank you for your assistance.

Mar 31 '24 05:03 Fan-chen04

I think it's fair to say this PR is stale / dead. Unless someone picks it up to push it over the line, it's not happening, sorry.

Mar 31 '24 10:03 piskvorky

I think it's fair to say this PR is stale / dead. Unless someone picks it up, it's not happening, sorry.

So pity. I try to make local file modifications to load this version. I am facing the same problem detailed in #3491 when trying to compile gensim-4.3.2.tar.gz from the source. I've followed the recommended steps, including ensuring all prerequisites are met, and using the python setup.py build_ext --inplace command, but I still encounter compilation errors. Any guidance or suggestions based on that issue would be greatly appreciated.

Apr 02 '24 05:04 Fan-chen04

gensim gensim copied to clipboard

Sequential lda optimization

gensim
gensim copied to clipboard