NiMARE
NiMARE copied to clipboard
Optimize GCLDAModel with `numba` and `Parallel`
Closes None.
Currently, the GCLDA model is slow, mainly when the vocabulary is large (for example for the 3228 terms from Neurosynth). _update_word_topic_assignments takes ~73 seconds per iteration and _update_peak_assignments takes ~52 seconds per iteration for a total of ~125 seconds of _update() per iteration. Compiling and running these functions with numba reduce these times to 2, 4, and 6 seconds per iteration respectively.
Changes proposed in this pull request:
- Convert
_update_word_topic_assignmentsand_update_peak_assignmentsto static methods, and add a decorator to compile and run them withnumbainnopythonmode.
Codecov Report
Base: 88.56% // Head: 87.68% // Decreases project coverage by -0.87% :warning:
Coverage data is based on head (
fc7c5e6) compared to base (1957869). Patch coverage: 77.21% of modified lines in pull request are covered.
:exclamation: Current head fc7c5e6 differs from pull request most recent head 0d2939c. Consider uploading reports for the commit 0d2939c to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## main #744 +/- ##
==========================================
- Coverage 88.56% 87.68% -0.88%
==========================================
Files 38 35 -3
Lines 4371 3913 -458
==========================================
- Hits 3871 3431 -440
+ Misses 500 482 -18
| Impacted Files | Coverage Δ | |
|---|---|---|
| nimare/annotate/gclda.py | 92.04% <77.21%> (-5.27%) |
:arrow_down: |
| nimare/meta/utils.py | 95.73% <0.00%> (-0.03%) |
:arrow_down: |
| nimare/dataset.py | 89.96% <0.00%> (ø) |
|
| nimare/__init__.py | ||
| nimare/base.py | ||
| nimare/utils.py |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
With the changes to the two update methods, it seems like they could be converted to functions instead. Is there any benefit to using @staticmethod over making them functions?
You're right. The two update methods can be converted to functions. I don't see any benefit in using staticmethod in this case.
Given these limitations (precision issue with sum, and codecov of jitted functions), do you think we should try Cython for these two functions?
I think @jdkent or @adelavega would be better able to answer that.
@adelavega _update_peak_assignments is taking a total of 132.477 s per iteration in python (3-4 s in numba). The line (236) that is having issues in numba is the one that takes 20% of the running time.
Line # Hits Time Per Hit % Time Line Contents
==============================================================
.
.
.
170 # Seed random number generator
171 2 168.0 84.0 0.0 np.random.seed(randseed)
172
173 # Iterate over all peaks x, and sample a new y and r assignment for each
174 768919 904087.0 1.2 0.7 for i_ptoken, doc in enumerate(doc_idx):
175 768918 646397.0 0.8 0.5 topic = peak_topic_idx[i_ptoken]
176 768918 622581.0 0.8 0.5 region = peak_region_idx[i_ptoken]
177
178 # Decrement count-matrices to remove current ptoken_topic_idx
179 # because ptoken will be reassigned to a new topic
180 768918 1234560.0 1.6 0.9 ny_region_topic[region, topic] -= 1 # Decrement count in Subregion x Topic count matx
181 768918 1064501.0 1.4 0.8 ny_doc_topic[doc, topic] -= 1 # Decrement count in Document x Topic count matrix
182
183 # Retrieve the probability of generating current x from all
184 # subregions: [R x T] array of probs
185 768918 1719597.0 2.2 1.3 p_x_subregions = (peak_probs[i_ptoken, :, :]).transpose()
186
187 # Compute the probabilities of all subregions given doc
188 # p(r|d) ~ p(r|t) * p(t|d)
189 # Counts of subregions per topic + prior: p(r|t)
190 768918 5259387.0 6.8 4.0 p_region_g_topic = ny_region_topic + delta
191
192 # Normalize the columns such that each topic's distribution over subregions sums to 1
193 768918 12123463.0 15.8 9.2 p_region_g_topic_sum = np.sum(p_region_g_topic, axis=0)
194 768918 3210778.0 4.2 2.4 p_region_g_topic = np.divide(p_region_g_topic, p_region_g_topic_sum)
195
196 # Counts of topics per document + prior: p(t|d)
197 768918 3648791.0 4.7 2.8 p_topic_g_doc = ny_doc_topic[doc, :] + alpha
198
199 # Reshape from (ntopics,) to (nregions, ntopics) with duplicated rows
200 768918 2823998.0 3.7 2.1 p_topic_g_doc = p_topic_g_doc.repeat(n_regions).reshape((-1, n_regions)).T
201
202 # Compute p(subregion | document): p(r|d) ~ p(r|t) * p(t|d)
203 # [R x T] array of probs
204 768918 2890817.0 3.8 2.2 p_region_g_doc = p_topic_g_doc * p_region_g_topic
205
206 # Compute the multinomial probability: p(z|y)
207 # Need the current vector of all z and y assignments for current doc
208 # The multinomial from which z is sampled is proportional to number
209 # of y assigned to each topic, plus constant gamma
210 # Compute the proportional probabilities in log-space
211 1537836 3591206.0 2.3 2.7 logp = nz_doc_topic[doc, :] * np.log(
212 768918 6299922.0 8.2 4.8 (ny_doc_topic[doc, :] + gamma + 1) / (ny_doc_topic[doc, :] + gamma)
213 )
214 # Add a constant before exponentiating to avoid any underflow issues
215 768918 14197485.0 18.5 10.7 p_peak_g_topic = np.exp(logp - np.max(logp))
216
217 # Reshape from (ntopics,) to (nregions, ntopics) with duplicated rows
218 768917 2708038.0 3.5 2.0 p_peak_g_topic = p_peak_g_topic.repeat(n_regions).reshape((-1, n_regions)).T
219
220 # Get the full sampling distribution:
221 # [R x T] array containing the proportional probability of all y/r combinations
222 768917 4631186.0 6.0 3.5 probs_pdf = p_x_subregions * p_region_g_doc * p_peak_g_topic
223
224 # Convert from a [R x T] matrix into a [R*T x 1] array we can sample from
225 768917 5906090.0 7.7 4.5 probs_pdf = np.reshape(probs_pdf, (n_regions * n_topics))
226
227 # Normalize the sampling distribution
228 768917 12366888.0 16.1 9.3 probs_pdf = probs_pdf / np.sum(probs_pdf)
229 # Float precision issue in Numba: https://github.com/numba/numba/issues/3426
230 768917 4950658.0 6.4 3.7 probs_pdf = np.trunc(probs_pdf * (10**12)) / (10**12)
231
232 # Sample a single element (corresponding to a y_i and c_i assignment for the ptoken)
233 # from the sampling distribution
234 # Returns a binary [1 x R*T] vector with a '1' in location that was sampled
235 # and zeros everywhere else
236 768917 25842795.0 33.6 19.5 assignment_vec = np.random.multinomial(1, probs_pdf)
237
238 # Reshape 1D back to [R x T] 2D
239 768917 5398268.0 7.0 4.1 assignment_arr = np.reshape(assignment_vec, (n_regions, n_topics))
240 # Transform the linear index of the sampled element into the
241 # subregion/topic (r/y) assignment indices
242 768917 4236942.0 5.5 3.2 assignment_idx = np.where(assignment_arr)
243 # Subregion sampled (r)
244 768917 1139169.0 1.5 0.9 region = assignment_idx[0][0]
245 # Topic sampled (y)
246 768917 737100.0 1.0 0.6 topic = assignment_idx[1][0]
247
248 # Update the indices and the count matrices using the sampled y/r assignments
249 # Increment count in Subregion x Topic count matrix
250 768917 1799035.0 2.3 1.4 ny_region_topic[region, topic] += 1
251 # Increment count in Document x Topic count matrix
252 768917 1148576.0 1.5 0.9 ny_doc_topic[doc, topic] += 1
253 # Update y->topic assignment
254 768917 669923.0 0.9 0.5 peak_topic_idx[i_ptoken] = topic
255 # Update y->subregion assignment
256 768917 704382.0 0.9 0.5 peak_region_idx[i_ptoken] = region
257
258 1 1.0 1.0 0.0 return ny_region_topic, ny_doc_topic, peak_topic_idx, peak_region_idx
@adelavega Running times per iteration for 200 topics:
| Function | Branch | Neurosynth (14369 studies, 3228 words) | NeuroQuery (13459 studies, 6145 words) |
|---|---|---|---|
_update_word_topic_assignments |
master | ~96s | ~1342s |
_update_word_topic_assignments |
gclda-numba | ~10s | ~149s |
_update_peak_assignments |
master | ~112s | ~96s |
_update_peak_assignments |
gclda-numba | ~55s | ~50s |
Typically we have 10,000 iterations, so to train GCLDA on Neurosynth with master would take ~25 days, which can be reduced to ~7.5 days with gclda-numba.
The latest changes showed improvement in _update_peak_assignments.
Creating chunks (n_chunks = n_cores) of the data and running each chunk in separate processors reduced the overhead and the number of delayed() calls by the embarrassingly parallel routine. Right now we have 2 nested for loops, one iterates over chunks and it's embarrassingly parallelized, and the other iterates within chunks extending the duration of the task that is submited to Parallel. These changes, besides reducing the overhead, reduce the time that it takes to dispatch so many calls (in our case 500,000) of very fast tasks to workers.
However, this is only scalable to a small number of CPUs. I got small linear scalability for only <= 4 CPUs in a local machine and <= 8 CPUs in an HPC.
What do you all think? Should we:
- keep numba support for
_update_word_topic_assignmentsand wait for numba to fix the issue with the multinomial generator to add support for_update_peak_assignments, or - keep both numba support for
_update_word_topic_assignmentsand Parallel for_update_peak_assignments, with a warning to users that the algorithm is only scalable to a small number of CPUs (4-8)?
I think number 1 is good, since using numba for _update_word_topic_assignments already makes GCLDA 10x faster.
@JulioAPeraza awesome! q: why is it only scalable to a small number of CPUs? I don't quite understand that.
but given that limitation, I agree with your reasoning and think #1 is a good option. That's an order of magnitude so not bad.
@adelavega
I'm not sure. I guess the scalability problem comes from the cost of communication added by having all the input arrays written to the disk (or copied to memory) in parallel. Although the increase in the number of CPUs reduces the calculation time, it may also increase the communication making the algorithm not scalable.
I just realized that by default Parallel will dump into a file any array larger than 1 megabyte. I ran a couple of tests with max_nbytes=None and now the algorithm is at least weakly scalable to all the cores available in my laptop (8 CPUs). I still need to test this on the HPC with a bigger number of CPUs. With 8 CPUs the calculation time per iteration of _update goes down from ~112s to ~15s.
Ah, disk i/o would certainly explain it. I wonder using "threads" would also help: https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism
I tested the scalability of the function with more CPUs in the HPC and it looks good. I also ran some tests using threads but it was slower than the default loky.
This PR is ready for review. Thanks!
Do you know if explicitly setting parallel=True would help the numba optimized _update_peak_assignments?
https://numba.readthedocs.io/en/stable/user/parallel.html
It looks like the main concern would be the same matrices are being access by the document loop, but it looks like it would be different parts of the arrays?
Otherwise, I took a closer look at the GCLDA algorithm, and this all looks reasonable to me. I may have some more nit-picky comments later, but perhaps @jdkent and @tsalo can help with those as well.
Thanks, @adelavega.
Do you know if explicitly setting
parallel=Truewould help the numba optimized_update_peak_assignments?
I tried parallel=True before in _update_word_topic_assignments and also in _update_peak_assignments running on our fork of numba, and I got a Segmentation Fault error. I will look into it early tomorrow.
just want to make note of the current gclda example and the one in this pull request:
current: https://nimare.readthedocs.io/en/latest/auto_examples/03_annotation/04_plot_gclda.html#sphx-glr-auto-examples-03-annotation-04-plot-gclda-py
pull request: https://nimare--744.org.readthedocs.build/en/744/auto_examples/03_annotation/04_plot_gclda.html#sphx-glr-auto-examples-03-annotation-04-plot-gclda-py
current decoding ROI: Term Weight motor 24.666614 seed 20.181775 structural 17.939355 cortex 15.722663 behavioral 15.696936 methods 11.237680 speech 11.218747 connectivity modeling 11.212097 sca17 11.212097 vbm 8.969678
pull request decoding ROI: Term Weight insula 11.288984 connectivity 10.711503 approaches 10.034653 network 8.780633 anterior insula 7.525990 nachr 7.525990 functional 6.869626 voice 6.271658 smokers 5.017326 nachr agonists 5.017326
I suppose I do not know how stochastic the results are on this test dataset since it is very small.
EDIT: consistent since first change in this pull request: https://nimare--744.org.readthedocs.build/en/744/auto_examples/03_annotation/04_plot_gclda.html#sphx-glr-auto-examples-03-annotation-04-plot-gclda-py