NiMARE icon indicating copy to clipboard operation
NiMARE copied to clipboard

Optimize GCLDAModel with `numba` and `Parallel`

Open JulioAPeraza opened this issue 3 years ago • 16 comments

Closes None.

Currently, the GCLDA model is slow, mainly when the vocabulary is large (for example for the 3228 terms from Neurosynth). _update_word_topic_assignments takes ~73 seconds per iteration and _update_peak_assignments takes ~52 seconds per iteration for a total of ~125 seconds of _update() per iteration. Compiling and running these functions with numba reduce these times to 2, 4, and 6 seconds per iteration respectively.

Changes proposed in this pull request:

  • Convert _update_word_topic_assignments and _update_peak_assignments to static methods, and add a decorator to compile and run them with numba in nopython mode.

JulioAPeraza avatar Aug 24 '22 15:08 JulioAPeraza

Codecov Report

Base: 88.56% // Head: 87.68% // Decreases project coverage by -0.87% :warning:

Coverage data is based on head (fc7c5e6) compared to base (1957869). Patch coverage: 77.21% of modified lines in pull request are covered.

:exclamation: Current head fc7c5e6 differs from pull request most recent head 0d2939c. Consider uploading reports for the commit 0d2939c to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #744      +/-   ##
==========================================
- Coverage   88.56%   87.68%   -0.88%     
==========================================
  Files          38       35       -3     
  Lines        4371     3913     -458     
==========================================
- Hits         3871     3431     -440     
+ Misses        500      482      -18     
Impacted Files Coverage Δ
nimare/annotate/gclda.py 92.04% <77.21%> (-5.27%) :arrow_down:
nimare/meta/utils.py 95.73% <0.00%> (-0.03%) :arrow_down:
nimare/dataset.py 89.96% <0.00%> (ø)
nimare/__init__.py
nimare/base.py
nimare/utils.py

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Aug 24 '22 16:08 codecov[bot]

With the changes to the two update methods, it seems like they could be converted to functions instead. Is there any benefit to using @staticmethod over making them functions?

tsalo avatar Aug 24 '22 16:08 tsalo

You're right. The two update methods can be converted to functions. I don't see any benefit in using staticmethod in this case.

JulioAPeraza avatar Aug 24 '22 18:08 JulioAPeraza

Given these limitations (precision issue with sum, and codecov of jitted functions), do you think we should try Cython for these two functions?

JulioAPeraza avatar Aug 25 '22 15:08 JulioAPeraza

I think @jdkent or @adelavega would be better able to answer that.

tsalo avatar Aug 25 '22 15:08 tsalo

@adelavega _update_peak_assignments is taking a total of 132.477 s per iteration in python (3-4 s in numba). The line (236) that is having issues in numba is the one that takes 20% of the running time.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
.
.
.
   170                                               # Seed random number generator
   171         2        168.0     84.0      0.0      np.random.seed(randseed)
   172                                           
   173                                               # Iterate over all peaks x, and sample a new y and r assignment for each
   174    768919     904087.0      1.2      0.7      for i_ptoken, doc in enumerate(doc_idx):
   175    768918     646397.0      0.8      0.5          topic = peak_topic_idx[i_ptoken]
   176    768918     622581.0      0.8      0.5          region = peak_region_idx[i_ptoken]
   177                                           
   178                                                   # Decrement count-matrices to remove current ptoken_topic_idx
   179                                                   # because ptoken will be reassigned to a new topic
   180    768918    1234560.0      1.6      0.9          ny_region_topic[region, topic] -= 1  # Decrement count in Subregion x Topic count matx
   181    768918    1064501.0      1.4      0.8          ny_doc_topic[doc, topic] -= 1  # Decrement count in Document x Topic count matrix
   182                                           
   183                                                   # Retrieve the probability of generating current x from all
   184                                                   # subregions: [R x T] array of probs
   185    768918    1719597.0      2.2      1.3          p_x_subregions = (peak_probs[i_ptoken, :, :]).transpose()
   186                                           
   187                                                   # Compute the probabilities of all subregions given doc
   188                                                   #     p(r|d) ~ p(r|t) * p(t|d)
   189                                                   # Counts of subregions per topic + prior: p(r|t)
   190    768918    5259387.0      6.8      4.0          p_region_g_topic = ny_region_topic + delta
   191                                           
   192                                                   # Normalize the columns such that each topic's distribution over subregions sums to 1
   193    768918   12123463.0     15.8      9.2          p_region_g_topic_sum = np.sum(p_region_g_topic, axis=0)
   194    768918    3210778.0      4.2      2.4          p_region_g_topic = np.divide(p_region_g_topic, p_region_g_topic_sum)
   195                                           
   196                                                   # Counts of topics per document + prior: p(t|d)
   197    768918    3648791.0      4.7      2.8          p_topic_g_doc = ny_doc_topic[doc, :] + alpha
   198                                           
   199                                                   # Reshape from (ntopics,) to (nregions, ntopics) with duplicated rows
   200    768918    2823998.0      3.7      2.1          p_topic_g_doc = p_topic_g_doc.repeat(n_regions).reshape((-1, n_regions)).T
   201                                           
   202                                                   # Compute p(subregion | document): p(r|d) ~ p(r|t) * p(t|d)
   203                                                   # [R x T] array of probs
   204    768918    2890817.0      3.8      2.2          p_region_g_doc = p_topic_g_doc * p_region_g_topic
   205                                           
   206                                                   # Compute the multinomial probability: p(z|y)
   207                                                   # Need the current vector of all z and y assignments for current doc
   208                                                   # The multinomial from which z is sampled is proportional to number
   209                                                   # of y assigned to each topic, plus constant gamma
   210                                                   # Compute the proportional probabilities in log-space
   211   1537836    3591206.0      2.3      2.7          logp = nz_doc_topic[doc, :] * np.log(
   212    768918    6299922.0      8.2      4.8              (ny_doc_topic[doc, :] + gamma + 1) / (ny_doc_topic[doc, :] + gamma)
   213                                                   )
   214                                                   # Add a constant before exponentiating to avoid any underflow issues
   215    768918   14197485.0     18.5     10.7          p_peak_g_topic = np.exp(logp - np.max(logp))
   216                                           
   217                                                   # Reshape from (ntopics,) to (nregions, ntopics) with duplicated rows
   218    768917    2708038.0      3.5      2.0          p_peak_g_topic = p_peak_g_topic.repeat(n_regions).reshape((-1, n_regions)).T
   219                                           
   220                                                   # Get the full sampling distribution:
   221                                                   # [R x T] array containing the proportional probability of all y/r combinations
   222    768917    4631186.0      6.0      3.5          probs_pdf = p_x_subregions * p_region_g_doc * p_peak_g_topic
   223                                           
   224                                                   # Convert from a [R x T] matrix into a [R*T x 1] array we can sample from
   225    768917    5906090.0      7.7      4.5          probs_pdf = np.reshape(probs_pdf, (n_regions * n_topics))
   226                                           
   227                                                   # Normalize the sampling distribution
   228    768917   12366888.0     16.1      9.3          probs_pdf = probs_pdf / np.sum(probs_pdf)
   229                                                   # Float precision issue in Numba: https://github.com/numba/numba/issues/3426
   230    768917    4950658.0      6.4      3.7          probs_pdf = np.trunc(probs_pdf * (10**12)) / (10**12)
   231                                           
   232                                                   # Sample a single element (corresponding to a y_i and c_i assignment for the ptoken)
   233                                                   # from the sampling distribution
   234                                                   # Returns a binary [1 x R*T] vector with a '1' in location that was sampled
   235                                                   # and zeros everywhere else
   236    768917   25842795.0     33.6     19.5          assignment_vec = np.random.multinomial(1, probs_pdf)
   237                                           
   238                                                   # Reshape 1D back to [R x T] 2D
   239    768917    5398268.0      7.0      4.1          assignment_arr = np.reshape(assignment_vec, (n_regions, n_topics))
   240                                                   # Transform the linear index of the sampled element into the
   241                                                   # subregion/topic (r/y) assignment indices
   242    768917    4236942.0      5.5      3.2          assignment_idx = np.where(assignment_arr)
   243                                                   # Subregion sampled (r)
   244    768917    1139169.0      1.5      0.9          region = assignment_idx[0][0]
   245                                                   # Topic sampled (y)
   246    768917     737100.0      1.0      0.6          topic = assignment_idx[1][0]
   247                                           
   248                                                   # Update the indices and the count matrices using the sampled y/r assignments
   249                                                   # Increment count in Subregion x Topic count matrix
   250    768917    1799035.0      2.3      1.4          ny_region_topic[region, topic] += 1
   251                                                   # Increment count in Document x Topic count matrix
   252    768917    1148576.0      1.5      0.9          ny_doc_topic[doc, topic] += 1
   253                                                   # Update y->topic assignment
   254    768917     669923.0      0.9      0.5          peak_topic_idx[i_ptoken] = topic
   255                                                   # Update y->subregion assignment
   256    768917     704382.0      0.9      0.5          peak_region_idx[i_ptoken] = region
   257                                           
   258         1          1.0      1.0      0.0      return ny_region_topic, ny_doc_topic, peak_topic_idx, peak_region_idx

JulioAPeraza avatar Aug 25 '22 21:08 JulioAPeraza

@adelavega Running times per iteration for 200 topics:

Function Branch Neurosynth (14369 studies, 3228 words) NeuroQuery (13459 studies, 6145 words)
_update_word_topic_assignments master ~96s ~1342s
_update_word_topic_assignments gclda-numba ~10s ~149s
_update_peak_assignments master ~112s ~96s
_update_peak_assignments gclda-numba ~55s ~50s

Typically we have 10,000 iterations, so to train GCLDA on Neurosynth with master would take ~25 days, which can be reduced to ~7.5 days with gclda-numba.

JulioAPeraza avatar Aug 26 '22 21:08 JulioAPeraza

The latest changes showed improvement in _update_peak_assignments.

Creating chunks (n_chunks = n_cores) of the data and running each chunk in separate processors reduced the overhead and the number of delayed() calls by the embarrassingly parallel routine. Right now we have 2 nested for loops, one iterates over chunks and it's embarrassingly parallelized, and the other iterates within chunks extending the duration of the task that is submited to Parallel. These changes, besides reducing the overhead, reduce the time that it takes to dispatch so many calls (in our case 500,000) of very fast tasks to workers.

However, this is only scalable to a small number of CPUs. I got small linear scalability for only <= 4 CPUs in a local machine and <= 8 CPUs in an HPC.

What do you all think? Should we:

  1. keep numba support for _update_word_topic_assignments and wait for numba to fix the issue with the multinomial generator to add support for _update_peak_assignments, or
  2. keep both numba support for _update_word_topic_assignments and Parallel for _update_peak_assignments, with a warning to users that the algorithm is only scalable to a small number of CPUs (4-8)?

I think number 1 is good, since using numba for _update_word_topic_assignments already makes GCLDA 10x faster.

JulioAPeraza avatar Oct 04 '22 14:10 JulioAPeraza

@JulioAPeraza awesome! q: why is it only scalable to a small number of CPUs? I don't quite understand that.

but given that limitation, I agree with your reasoning and think #1 is a good option. That's an order of magnitude so not bad.

adelavega avatar Oct 06 '22 16:10 adelavega

@adelavega I'm not sure. I guess the scalability problem comes from the cost of communication added by having all the input arrays written to the disk (or copied to memory) in parallel. Although the increase in the number of CPUs reduces the calculation time, it may also increase the communication making the algorithm not scalable. I just realized that by default Parallel will dump into a file any array larger than 1 megabyte. I ran a couple of tests with max_nbytes=None and now the algorithm is at least weakly scalable to all the cores available in my laptop (8 CPUs). I still need to test this on the HPC with a bigger number of CPUs. With 8 CPUs the calculation time per iteration of _update goes down from ~112s to ~15s.

JulioAPeraza avatar Oct 06 '22 19:10 JulioAPeraza

Ah, disk i/o would certainly explain it. I wonder using "threads" would also help: https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism

adelavega avatar Oct 06 '22 19:10 adelavega

I tested the scalability of the function with more CPUs in the HPC and it looks good. I also ran some tests using threads but it was slower than the default loky.

This PR is ready for review. Thanks!

JulioAPeraza avatar Oct 12 '22 14:10 JulioAPeraza

Do you know if explicitly setting parallel=True would help the numba optimized _update_peak_assignments?

https://numba.readthedocs.io/en/stable/user/parallel.html

It looks like the main concern would be the same matrices are being access by the document loop, but it looks like it would be different parts of the arrays?

adelavega avatar Oct 12 '22 22:10 adelavega

Otherwise, I took a closer look at the GCLDA algorithm, and this all looks reasonable to me. I may have some more nit-picky comments later, but perhaps @jdkent and @tsalo can help with those as well.

adelavega avatar Oct 12 '22 22:10 adelavega

Thanks, @adelavega.

Do you know if explicitly setting parallel=True would help the numba optimized _update_peak_assignments?

I tried parallel=True before in _update_word_topic_assignments and also in _update_peak_assignments running on our fork of numba, and I got a Segmentation Fault error. I will look into it early tomorrow.

JulioAPeraza avatar Oct 12 '22 23:10 JulioAPeraza

just want to make note of the current gclda example and the one in this pull request:

current: https://nimare.readthedocs.io/en/latest/auto_examples/03_annotation/04_plot_gclda.html#sphx-glr-auto-examples-03-annotation-04-plot-gclda-py

pull request: https://nimare--744.org.readthedocs.build/en/744/auto_examples/03_annotation/04_plot_gclda.html#sphx-glr-auto-examples-03-annotation-04-plot-gclda-py

current decoding ROI: Term Weight motor 24.666614 seed 20.181775 structural 17.939355 cortex 15.722663 behavioral 15.696936 methods 11.237680 speech 11.218747 connectivity modeling 11.212097 sca17 11.212097 vbm 8.969678

pull request decoding ROI: Term Weight insula 11.288984 connectivity 10.711503 approaches 10.034653 network 8.780633 anterior insula 7.525990 nachr 7.525990 functional 6.869626 voice 6.271658 smokers 5.017326 nachr agonists 5.017326

I suppose I do not know how stochastic the results are on this test dataset since it is very small.

EDIT: consistent since first change in this pull request: https://nimare--744.org.readthedocs.build/en/744/auto_examples/03_annotation/04_plot_gclda.html#sphx-glr-auto-examples-03-annotation-04-plot-gclda-py

jdkent avatar Oct 20 '22 19:10 jdkent