gensim
                                
                                 gensim copied to clipboard
                                
                                    gensim copied to clipboard
                            
                            
                            
                        Sequential lda optimization
Hi team,
This pull request addresses the issue of the slow computational time of LdaSeqModel #1545 . In the current stage, it contains cythonized methods of SSLM and LdaPost classes. It is not the final version, I plan to add the following changes:
- [x] cythonize sslm_counts_initfunction of sslm class
- [x] refactor code (e.g. chain.sslm_counts_init(topic_obs_variance, topic_chain_variance, sstats)inLdaSeqModelclass, here class method is used, however I don't know why) + many others
- [x] add more documentation for functions
- [ ] implement DIM model
- [ ] tune memory usage and allocation
- [ ] fix bug which I found out in the update_obsfunction ofsslmclass. You can compare the logic in the gensim implementation and Blei's lab implementation
By now the performance improvement is 3-5 times on toy datasets e.g. https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_ldaseqmodel.py. I still haven't tried to benchmark for larger datasets as it takes a while to train original implementation, however, the speed up seems to be much higher based on synthetic data I tried.
Extracted topics of optimized model differ a bit from the original one (gensim tests are passed). The exact places in the code which causes it are
- update_zetafunction of- sslmclass
- np.negativeat the end of- df_obsEverything except these two places works absolutely the same. From my perspective, it is not a bug in the code, just issues with the precision. (I hope so)
@gojomo @mpenkov @piskvorky What do you think about the idea of such changes in gensim in general, code structure and code itself particularly? I plan to finish with the optimization and bug fixes I mentioned above by the end of this month.
P.S. I've created the pull request before ending to be sure that these changes are valuable for the project :)
I've run a test using setup from ldaseqmodel example notebook in gensim docs. As for now, the current time distribution is present in the picture below. Almost all time is spent in scipy Python optimization code. The time which is spent in Cython marked with red rectangles. The time in rectangles (right up corner) is in nanoseconds.
The test was done using Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz
The raw output of a profiler + file for kcachegrind is on gdrive
Thanks! Do you have any high-level numbers, on a larger dataset? To get a sense of the scale of these optimizations, now-vs-before.
I trained both implementations using the first 10 years of the UN General Debate dataset . The code I used:
from gensim.corpora import Dictionary
from gensim import utils
from gensim.models import LdaSeqModel
import numpy as np
import pandas as pd
data = pd.read_csv("datasets/un-general-debates.csv")
data["year"] = data["year"].apply(lambda x: int(x))
data = data[data["year"] < 1980]
paragraphs = [[(p,year) for p in speech.split("\n") if len(p) > 10] for speech, year  in zip(data.text, data.year)]
paragraphs = pd.DataFrame(data=sorted([p for s in paragraphs for p in s], key=lambda x: x[1]), columns=["text", "year"])
time_slices = paragraphs.year.value_counts().values
paragraphs = [utils.simple_preprocess(p) for p in paragraphs.text]
dictionary = Dictionary(paragraphs)
corpus = [dictionary.doc2bow(text) for text in paragraphs]
del data, paragraphs
model = LdaSeqModel(corpus=corpus,id2word=dictionary, num_topics=15, time_slice=time_slices)
The results: Number of docs: 61674 Vocabulary Length: 29485 Number of topics: 15 Number of time slices: 10
Old Implementation : ~47 hours New Implementation ~3 hours 40 mins
Thanks for the timings. That's a massive speedup!
And are the new-vs-old results comparable, quality-wise?
Sure... I accidentally rewrote a trained old model :(
So for now I just trained with two years of data using the same code + removed stopwords using nltk set. I've put the resulting topics in the excel table . As for me topics are very similar.
Regarding the model trained with 10 years of data, let me train it one more time :) I'll back in a couple of days. Meanwhile, I'll find a more formal way to compare two SeqLda models.
I've trained the model one more time. Pickled models are available on gdrive. I calculated the coherence score for both models as it was done in the docs tutorial. I am not sure that it is the right choice to compare classic vs optimized models using a coherence score, I think they should be equal in a perfect case. Nevertheless, they are very close for all time slices.
The code I used:
from gensim.models.coherencemodel import CoherenceModel
import pickle
[CoherenceModel(topics=model.dtm_coherence(time=i), corpus=corpus, 
                dictionary=dictionary, coherence='u_mass').get_coherence() 
 for i in range(len(time_slices))]
Results:
| Time slice | Old | New | 
|---|---|---|
| 1 | -2.390386933137473 | -2.3849280478010373 | 
| 2 | -2.3854065916265434 | -2.38503617012231 | 
| 3 | -2.412550400281885 | -2.412624753824853 | 
| 4 | -2.403873956103539 | -2.402234764724972 | 
| 5 | -2.4228081694247994 | -2.416472944689486 | 
| 6 | -2.462690192130912 | -2.466086114010153 | 
| 7 | -2.4189737189539 | -2.4169199318713814, | 
| 8 | -2.484822837691009 | -2.488989346989372 | 
| 9 | -2.5564440634792547 | -2.557404323965962 | 
| 10 | -2.551624889028128 | -2.54892377159758] | 
Thanks so much. This is helpful. How can we use the new version? Is it built in the package, or we need to revise any of the codes ourselves?
Good point. @Sharganov can you fix the bug you found above & tune the memory usage? Let's get your work merged & released!
@Sharganov Ping
Thats some impressive work @Sharganov. Any news on whether this is anywhere close to being merged/released?
Hi everyone, sorry for a long break. I fixed the bug which I mentioned. Also I'll look at memory usage.
@Sharganov what did you find about the memory? We're planning a new release, and this looks a solid candidate for including if finished up.
@piskvorky I am planning to look at memory on aprox. 28/02 - 04/03, I'll have a lot of free time on these dates. Hope it fits your plans for release.
No problem, thanks. If we don't manage to finish the optimization in this release, we can include it in the next.
This is great. Looking forward to the release! Can this be used in its current state?
Really looking forward to the faster release, @piskvorky @Sharganov ! Great job!
Hi everyone, any updates here?
@Sharganov how did your memory optimization go? Looks like there is quite some demand for this feature!
What's the status? Do you need any help/volunteer to bridge gaps against changes in the code base?
Please, any updates so far? I really need to enhance the efficiency of the LdaSeqmodel. If we set aside concerns about memory optimization for the time being, would it be possible for me to make local file modifications to load this version? Or does this version have any errors or conflicts with the current one? Thank you for your assistance.
I think it's fair to say this PR is stale / dead. Unless someone picks it up to push it over the line, it's not happening, sorry.
I think it's fair to say this PR is stale / dead. Unless someone picks it up, it's not happening, sorry.
So pity. I try to make local file modifications to load this version. I am facing the same problem detailed in #3491 when trying to compile gensim-4.3.2.tar.gz from the source. I've followed the recommended steps, including ensuring all prerequisites are met, and using the python setup.py build_ext --inplace command, but I still encounter compilation errors. Any guidance or suggestions based on that issue would be greatly appreciated.