julia Performance regression in RxInfer code on current master

We have a set of small benchmarks to quickly test our code in RxInfer. The aim of the package is to run efficient Bayesian inference, potentially on low-power low-memory devices like RaspberryPI. We just noticed, that on Julia 1.10 we have quite a noticeable GC regression. Consider this notebook. Not an MWE but still, this notebook computes Bayesian posteriors in a simple linear Gaussian state-space probabilistic model. There are two settings:

Filtering, for each time step $t$ use observations up to the time step $t$.
Smoothing, for each time step $t$ use observations up to the time step $T > t$

Here are the results on the current Julia release

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)

julia> @benchmark run_filtering($datastream, $n, $v)
BenchmarkTools.Trial: 1504 samples with 1 evaluation.
 Range (min … max):  2.633 ms … 13.932 ms  ┊ GC (min … max): 0.00% … 69.28%
 Time  (median):     3.073 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.319 ms ±  1.058 ms  ┊ GC (mean ± σ):  7.08% ± 13.05%

   ▅▇▇██▇▅▃▂                                                 ▁
  ██████████▇▇▅▇█▇▇▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇██▇▆█▇▆▄▄▅ █
  2.63 ms      Histogram: log(frequency) by time     7.92 ms <

 Memory estimate: 2.35 MiB, allocs estimate: 63823.

julia> @benchmark run_smoothing($data, $n, $v)
BenchmarkTools.Trial: 288 samples with 1 evaluation.
 Range (min … max):  13.868 ms … 29.987 ms  ┊ GC (min … max):  0.00% … 35.63%
 Time  (median):     15.545 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   17.411 ms ±  3.975 ms  ┊ GC (mean ± σ):  10.81% ± 14.33%

   ▄▃█▁▄▅▁
  ▇███████▇▆▅▅▃▃▄▂▄▃▂▁▃▃▁▁▁▁▁▁▁▁▁▂▃▅▃▃▅▅▄▃▂▃▃▃▃▃▂▂▁▄▂▁▃▁▁▂▂▂▂ ▃
  13.9 ms         Histogram: frequency by time        28.4 ms <

 Memory estimate: 10.05 MiB, allocs estimate: 220417.

Here are the results on the 1.10-beta1

julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)

julia> @benchmark run_filtering($datastream, $n, $v)
BenchmarkTools.Trial: 1308 samples with 1 evaluation.
 Range (min … max):  3.260 ms … 78.207 ms  ┊ GC (min … max): 0.00% … 94.71%
 Time  (median):     3.479 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.818 ms ±  3.293 ms  ┊ GC (mean ± σ):  6.64% ±  7.41%

      ▄▆██▅▁
  ▂▃▄▇██████▇▅▅▃▃▃▃▃▂▂▃▃▃▃▃▃▂▃▃▂▂▂▁▂▂▂▂▁▂▁▂▂▁▂▂▁▁▁▂▂▂▂▂▂▂▂▂▂ ▃
  3.26 ms        Histogram: frequency by time        4.94 ms <

 Memory estimate: 2.51 MiB, allocs estimate: 69824.

julia> @benchmark run_smoothing($data, $n, $v)
BenchmarkTools.Trial: 291 samples with 1 evaluation.
 Range (min … max):  15.160 ms … 88.841 ms  ┊ GC (min … max): 0.00% … 79.71%
 Time  (median):     15.757 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.336 ms ±  7.862 ms  ┊ GC (mean ± σ):  7.05% ± 11.57%

  █▅▁
  █████▇▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▄ ▅
  15.2 ms      Histogram: log(frequency) by time      57.9 ms <

 Memory estimate: 10.12 MiB, allocs estimate: 222915.

As you can see in the case of run_filtering, the maximum time jumped from 13ms to 78ms. The GC max also indicates a jump from 69% to 94%. In the case of run_smoothing the situation is similar, the maximum time jumped from 29ms to 88ms. The GC max jumped from 35% to 79%.

The inference precedure allocates a lot of intermediate "messages" in a form of distributions from Distributions.jl package, but does not use any sampling. Instead, it computes analytical solutions for posteriors. This analytical solutions also rely on dynamic multiple dispatch in many places. Eliminating dynamic multiple dispatch is not really an option, it just how it works and it was quite efficient anyway until now.

The major differences between two functions is that run_filtering allocates a lot of information (messages) and do not use it afterwards that is probably can be free-ed right away, and the run_smoothing retains/stores this information till the end of the procedure. You can also see that the resulting minimum execution time is also worse in both cases.

I think this is quite a severe regression, especially for the filtering case, which is supposed to run the real-time Bayesian inference with as little GC pauses as possible. We can of course refine our code base, but in the mean-time can it be improved in general? What can cause this? How should we proceed and debug this? How can we help figuring out further?

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 12 × Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 12 virtual cores

julia>

Jul 28 '23 11:07 bvdmitri

Thanks for the report and the nice reproducers. The GC heuristics were significantly changed in https://github.com/JuliaLang/julia/pull/50144 so it isn't too surprising that there might be cases where it needs to be further refined.

Putting this on the milestone for now so it isn't forgotten at least.

Jul 28 '23 11:07 KristofferC

Thanks, glad to help!

Since the reproducers are not MWE, to give you an idea what might be happening, I think one potential problem here is that the program (as a part of the algorithm) allocates A LOT of small and short-lived objects. The algorithm, on the high-level, involves creating a sparse graph and passing messages between the nodes in the graph. These messages are needed only for a short period of time. In the benchmark example the messages are encoded as the Normal distributions, which consists of two numbers of Float64. These messages are allocated in the heap and are part of the dynamic multiple dispatch, because the algorithm does not (and can not) know in advance the types of the messages.

As a side note, I could not run the benchmark on master because some of the downstream packages failed to compile on master.

Jul 28 '23 12:07 bvdmitri

So I ran your code (1.10 + https://github.com/JuliaLang/julia/pull/50682) and I see a wall time regression, but not necessarily a GC, because max GC time might not be the best way to quantify it. Specifically because if in one case we do 100 runs and only in one of the runs we do a collection but it's a big collection, we might see 90% max GC usage, but median/mean might be lower. The more concerning thing is that there seems to be a regression overall in the performance of the code that I'm not sure if it's just GC related.

1.9

julia> @benchmark run_filtering(datastream, n, $v)
BenchmarkTools.Trial: 2421 samples with 1 evaluation.
 Range (min … max):  1.847 ms …   5.287 ms  ┊ GC (min … max): 0.00% … 60.24%
 Time  (median):     1.906 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.064 ms ± 612.807 μs  ┊ GC (mean ± σ):  6.81% ± 12.77%

  ▃█
  ██▅▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▂ ▂
  1.85 ms         Histogram: frequency by time        4.66 ms <

 Memory estimate: 2.35 MiB, allocs estimate: 63824.

julia> @benchmark run_smoothing(data, n, $v)
BenchmarkTools.Trial: 476 samples with 1 evaluation.
 Range (min … max):   8.658 ms … 28.003 ms  ┊ GC (min … max):  0.00% … 29.64%
 Time  (median):      9.164 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   10.508 ms ±  2.647 ms  ┊ GC (mean ± σ):  10.46% ± 13.81%

  ▆█▅▄▄▂▁                       ▁▁
  ███████▇███▇▆▄▅▁▅▁▄▁▁▄▄▄▄▇▇▇▆▇███▇█▇█▇█▆▇▇▅▅▆▄▅▄▅▁▁▁▁▁▁▄▁▄▄ ▇
  8.66 ms      Histogram: log(frequency) by time        18 ms <

 Memory estimate: 10.05 MiB, allocs estimate: 220418.

1.10 + that PR

julia> @benchmark run_filtering(datastream, n, $v)
BenchmarkTools.Trial: 1642 samples with 1 evaluation.
 Range (min … max):  2.746 ms … 23.783 ms  ┊ GC (min … max): 0.00% … 87.46%
 Time  (median):     2.908 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.044 ms ±  1.398 ms  ┊ GC (mean ± σ):  4.46% ±  8.07%

    ▃▆▅▁▆▂▄▄▃▅▄▇▆█▅▇▆▆▇██▅▆▂▅▂
  ▃▇███████████████████████████▇▇▇▆▅▄▃▃▃▂▂▃▂▂▁▂▁▂▂▁▁▁▁▁▁▁▁▁▂ ▅
  2.75 ms        Histogram: frequency by time         3.3 ms <

 Memory estimate: 2.51 MiB, allocs estimate: 69825.

julia> @benchmark run_smoothing(data, n, $v)
BenchmarkTools.Trial: 416 samples with 1 evaluation.
 Range (min … max):  10.883 ms … 33.700 ms  ┊ GC (min … max): 0.00% … 65.24%
 Time  (median):     11.343 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.041 ms ±  3.190 ms  ┊ GC (mean ± σ):  4.34% ± 10.21%

  █▇█
  ███▆▁▁▁▄▆▇▄▁▄▄▁▁▁▁▁▁▄▁▄▁▄▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▄▇▄ ▆
  10.9 ms      Histogram: log(frequency) by time        28 ms <

 Memory estimate: 10.12 MiB, allocs estimate: 222916.

So the mean GC went slightly down. But notice how the minimum time went up meaning that there is another regression somewhere, specially because on the minimum run the GC didn't run at all.

Jul 28 '23 21:07 gbaraldi

@gbaraldi I agree with you that it could be some sort of other regression or a combination of regressions. I tend not to use mean statistics for comparison, specifically because they can be affected by background processes on a computer. I also do not immediately understand how to compare the mean statistics in percentages (for GC) because the base (wall time) is different. It could the case that the mean in percentage is lower because the absolute wall time is bigger.

For example for the filtering case, GC mean is just exactly the same with only small improvement is you account for the wall time:

julia> 2.064 * 0.0681
0.1405584

julia> 3.044 * 0.0446
0.1357624

But the mean values in benchmarks are tricky anyway. I tend to compare the min/max values instead, I recall an article about that but I've read it long ago and don't really have a reference but the rough idea is:

min reflects the theoretical best performance of an algorithm, if everything goes perfect
max reflects the worst case scenario, if everything goes wrong
mean/median statistics are heavily affected by background tasks and cannot be reliably compared until you account for the skewness of the benchmark distributions (and I can see that skewness is largely different in benchmarks between 1.9 and 1.10)

I don't want to say that comparing mean/median is always a bad idea. I just want to say that it is safer to compare min/max values in majority of the cases, unless you invest much more time in proper analysis.

The max GC time is very consistent across different platforms and computers (I checked my 2 computers several times + plus your benchmark). It does not look to me as an outlier. But you are also correct saying that GC can be only one part of the story.

I checked the benchmarks with GC disabled. These are my results:

GC.enable(false)
benchmark = @benchmark run_filtering($datastream, $n, $v) samples=100
GC.enable(true)
GC.gc(true)
benchmark

# 1.9
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  1.829 ms …   3.831 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.064 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.106 ms ± 271.007 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂▂▃ ▇▅█ ▃▅▂▄▂                                              
  ▅▆███▃███▅█████▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃ ▃
  1.83 ms         Histogram: frequency by time        3.44 ms <

 Memory estimate: 2.35 MiB, allocs estimate: 63823.

# 1.10-beta1
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  2.568 ms …   4.568 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.749 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.796 ms ± 319.726 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▁  ▁▄█▆▃                                                    
  ███▅█████▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▁▁▁▃ ▃
  2.57 ms         Histogram: frequency by time        4.31 ms <

 Memory estimate: 2.51 MiB, allocs estimate: 69824.

GC.enable(false)
benchmark = @benchmark run_smoothing($data, $n, $v) samples=100
GC.enable(true)
GC.gc(true)
benchmark

# 1.9
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  8.132 ms …  11.818 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.832 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.997 ms ± 758.615 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂        █ ▂ ▃                                               
  █▅▇█▃▃▁▅▆███▆█▄▇▄▄▃▃▁▃▆▃▅▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▃▄▁▃▁▃▁▁▁▃▁▁▁▁▁▁▁▁▃ ▃
  8.13 ms         Histogram: frequency by time        11.8 ms <

 Memory estimate: 10.05 MiB, allocs estimate: 220417.

# 1.10-beta1
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):   9.918 ms …  12.661 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.534 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.644 ms ± 460.803 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▄█▅▃▃                                             
  ▅▁▁▅▃▅▁▅▆▇▇██████▆▅█▇▆▆▃▁▅▁▁▁▁▁▁▁▃▁▁▁▃▁▃▃▃▁▁▅▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  9.92 ms         Histogram: frequency by time         12.4 ms <

 Memory estimate: 10.12 MiB, allocs estimate: 222915.

So yes, there is some sort of regression, which looks like it is further escalated by the GC timings. Something happens which triggers GC to spend even more time in the worst case scenario. But I'm not sure how to narrow it down further. The output of the Profile is very different between versions and basically unreadable.

Jul 29 '23 10:07 bvdmitri

The GC might decide to coalesce the garbage collections to bring down overall time while having potentially longer pauses.

Aug 02 '23 18:08 gbaraldi

I imagine I'm having the same issue, but I'm not sure.

1.9.2

julia> @benchmark compute_fracture_polys($user_input)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 160.110 s (26.37% GC) to evaluate,
 with a memory estimate of 689.02 GiB, over 548291410 allocations.

1.10.0-beta1

julia> @benchmark compute_fracture_polys($user_input)
# Still running 17 hours later

1.10 beta makes my existing program unusable.

Aug 08 '23 13:08 nathanrboyer

So there is a bug on 1.10 beta that should be fixed on the next beta, but in any case is there a way for you to share an open source version of your code? Or try and run it on recent nightly?

Aug 08 '23 15:08 gbaraldi

I downloaded the nightly release from https://julialang.org/downloads/nightlies/ and got the same bad behavior as 1.10.0-beta1. I created a repository here for testing: https://github.com/nathanrboyer/JuliaNightlyTestCode

Aug 09 '23 14:08 nathanrboyer

Oddly enough I think this might not be a GC bug at all, this is where perf says we are spending time. Which makes me think this is a subtyping bug.

The other two calls are about 4x faster on master than on 1.9 for me.

Aug 09 '23 16:08 gbaraldi

@vtjnash it hangs while calling

ijl_matching_methods (types=0x7f6a96ddbb90, mt=0x7f6a8c1c78a0 <jl_system_image_data+28221664>, lim=-1, include_ambiguous=0, world=31777, min_valid=0x7f69267e4ea0,
    max_valid=0x7f69267e4eb0, ambig=0x0)

types being Tuple{typeof(Base.:(*)), Any, Any} Any clue?

Aug 09 '23 17:08 gbaraldi

I really have no clue. I haven't done any profiling or benchmarking on the code. It is basically a brute force computation that used to be fast enough. I can open a separate issue if not related to GC.

EDIT: Moving discussion of my issue to #50859.

Aug 09 '23 17:08 nathanrboyer

Could you open a separate issue, and I wonder if you can make an example that runs a bit quicker on 1.9?

Aug 09 '23 17:08 gbaraldi

I am not sure I understand that. Calling methods with lim=-1 is not something that our compiler does, because that causes somewhat obvious performance problems as noted there.

Aug 09 '23 20:08 vtjnash

I believe return_type does https://github.com/JuliaLang/julia/blob/a889575cc168d5efca211766a6ddad7b3069594a/base/compiler/typeinfer.jl#L1125

Aug 09 '23 20:08 gbaraldi

I hid those comments because they actually belong to https://github.com/JuliaLang/julia/issues/50859

Aug 22 '23 15:08 gbaraldi

Based on the discussion in https://github.com/JuliaLang/julia/issues/50704#issuecomment-1656333964 I don't think this should really be on the milestone. It seems to be a fairly marginal performance regression and it doesn't really have any analysis of what has gotten slower. I'll remove it from the milestone for now, depending on results from some extra profiling work it could potentially be added back.

(Also, the link to the repro no longer works)

Aug 23 '23 14:08 KristofferC

Ok, just in case there is a permanent link to the repro.

Aug 23 '23 14:08 bvdmitri

@KristofferC The performance regression is pretty hard to bisect. It is not very consistent between runs and pretty much heavily depends on the percentage of GC spent during the run. I did bisect, however, the increase in number of allocations and total size of allocations to 51e3bc33dec9be9ebe5ad5fe5f4da56a3bab33c9 , but I'm not sure what this commit did and how it could cause regression in our case.

Aug 24 '23 17:08 bvdmitri

@KristofferC On the current master I get significantly worse results, basically there is a ~15x regression for the second benchmark. It doesn't look like it GC related as its pretty consistent (and it allocates less than in 1.10).

1.10

BenchmarkTools.Trial: 2694 samples with 1 evaluation.
 Range (min … max):  1.671 ms …   3.392 ms  ┊ GC (min … max): 0.00% … 46.05%
 Time  (median):     1.777 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.855 ms ± 323.736 μs  ┊ GC (mean ± σ):  4.24% ± 10.19%

  ▂▇██▇█▇▂                                          ▁ ▁       ▁
  ████████▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅███▇▇████ █
  1.67 ms      Histogram: log(frequency) by time      3.27 ms <

 Memory estimate: 2.46 MiB, allocs estimate: 67408.
BenchmarkTools.Trial: 582 samples with 1 evaluation.
 Range (min … max):  7.600 ms … 12.065 ms  ┊ GC (min … max): 0.00% … 32.69%
 Time  (median):     7.926 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.589 ms ±  1.282 ms  ┊ GC (mean ± σ):  7.81% ± 11.24%

    ▇█▇▅▂
  ▄███████▆▇▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▅▅▆▆▇▆▇▇▇█▅▇█▇▅▇▇▆▇█▇▅█▆▆▅▄▆▇▇▇ ▇
  7.6 ms       Histogram: log(frequency) by time     11.9 ms <

 Memory estimate: 10.39 MiB, allocs estimate: 228968.

master

BenchmarkTools.Trial: 2277 samples with 1 evaluation.
 Range (min … max):  1.939 ms …  10.002 ms  ┊ GC (min … max): 0.00% … 79.32%
 Time  (median):     2.090 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.195 ms ± 748.385 μs  ┊ GC (mean ± σ):  5.23% ± 10.61%

  ▅█▆
  ████▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▅▆▇ █
  1.94 ms      Histogram: log(frequency) by time      6.91 ms <

 Memory estimate: 2.27 MiB, allocs estimate: 61943.
BenchmarkTools.Trial: 43 samples with 1 evaluation.
 Range (min … max):  115.886 ms … 128.093 ms  ┊ GC (min … max): 0.00% … 8.04%
 Time  (median):     117.313 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   117.835 ms ±   2.359 ms  ┊ GC (mean ± σ):  0.37% ± 1.58%

  ▂▅██▂   ▅▂
  █████▁▅█████▅▁▁▅▅▅▁▅▁▁▅▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  116 ms           Histogram: frequency by time          128 ms <
 Memory estimate: 10.16 MiB, allocs estimate: 234019.

Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 11 × Apple M3 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 5 virtual cores

Jan 22 '24 13:01 bvdmitri

The script I used for testing is this

Script

using RxInfer, BenchmarkTools, Random

seed = 123

rng = MersenneTwister(seed)

v = 100.0

n = 500

hidden     = collect(1:n)
data       = hidden + rand(rng, Normal(0.0, sqrt(v)), n);
datastream = from(data) |> map(NamedTuple{(:y, ), Tuple{Float64}}, (d) -> (y = d, ));


@model function filtering(; c::ConstVariable, v::ConstVariable)

    x_prior_mean = datavar(Float64)
    x_prior_var  = datavar(Float64)

    x_prior ~ Normal(mean = x_prior_mean, variance = x_prior_var)
    x_next ~ x_prior + c

    y = datavar(Float64)
    y ~ Normal(mean = x_next, variance = v)

end

function run_filtering(datastream, n, v)

    autoupdates = @autoupdates begin
        x_prior_mean, x_prior_var = mean_var(q(x_next))
    end

    return infer(
        model       = filtering(c = 1.0, v = v),
        datastream  = datastream,
        autoupdates = autoupdates,
        initmarginals = (x_next = NormalMeanVariance(0.0, 10000.0), ),
        keephistory = n,
        historyvars = (x_next = KeepLast(), ),
        free_energy = true,
        autostart = true,
        postprocess = UnpackMarginalPostprocess()
    )

end


result_filtering = run_filtering(datastream, n, v)

@model function smoothing(n; c::ConstVariable, v::ConstVariable)

    x_prior ~ Normal(mean = 0.0, variance = 10000.0)

    x = randomvar(n)
    y = datavar(Float64, n)

    x_prev = x_prior

    for i in 1:n
        x[i] ~ x_prev + c
        y[i] ~ Normal(mean = x[i], variance = v)
        x_prev = x[i]
    end

    return x, y
end

function run_smoothing(data, n, v)
    return infer(
        model = smoothing(n, c = 1.0, v = v),
        data  = (y = data, ),
        returnvars = KeepLast(),
        free_energy = true,
        postprocess = UnpackMarginalPostprocess()
    )
end

result_smoothing = run_smoothing(data, n, v)

benchmark1 = @benchmark run_filtering($datastream, $n, $v)
benchmark2 = @benchmark run_smoothing($data, $n, $v)

show(benchmark1)
show(benchmark2)

display(benchmark1)
display(benchmark2)

Jan 22 '24 13:01 bvdmitri

On the current master I get significantly worse results, basically there is a ~15x regression for the second benchmark.

Bisected that to #50927 (cc @vtjnash).

Jan 23 '24 13:01 maleadt

Reduced the example, seems like the problem is not during the execution of the algorithm, but during the model creation. From that perspective the model creation for both models have regressed by about a factor 20-40

# 1.10.1

→ julia --project=. ~/.julia/dev/julia/script.jl
Trial(21.125 μs)Trial(3.653 ms)BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  21.125 μs …  1.522 ms  ┊ GC (min … max): 0.00% … 94.81%
 Time  (median):     23.375 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.275 μs ± 29.652 μs  ┊ GC (mean ± σ):  2.37% ±  1.91%

            ▃▃▇▇█▃▃▁
  ▁▁▁▁▁▁▁▁▂▆████████▆▄▄▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  21.1 μs         Histogram: frequency by time        30.3 μs <

 Memory estimate: 18.21 KiB, allocs estimate: 433.
BenchmarkTools.Trial: 1274 samples with 1 evaluation.
 Range (min … max):  3.653 ms …   7.132 ms  ┊ GC (min … max): 0.00% … 36.47%
 Time  (median):     3.721 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.924 ms ± 547.249 μs  ┊ GC (mean ± σ):  3.79% ±  8.58%

  ▂█▅▃▁▁
  ██████▇▆▇▆▅▅▄▅▃▅▅▃▄▃▃▁▄▁▃▃▁▅▃▁▃▃▁▅▅▁▅▅▆▆▅▆▅▆▆▄▆▆▇▇█▆▆▄▅▄▃▁▄ ▇
  3.65 ms      Histogram: log(frequency) by time      5.83 ms <

 Memory estimate: 4.05 MiB, allocs estimate: 103123

master

→ ./julia --startup-file=no script.jl
Trial(384.250 μs)Trial(119.892 ms)BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  384.250 μs …  4.570 ms  ┊ GC (min … max): 0.00% … 88.93%
 Time  (median):     420.312 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   448.273 μs ± 97.839 μs  ┊ GC (mean ± σ):  0.09% ±  0.89%

    ▃█▇▅▃▇▆▃                                                   ▂
  ▄▄██████████▇▇▆▆▇▇▆▅▆▅▄▅▆▄▅▅▄▄▄▅▄▆▄▄▅▂▅▃▄▄▅▄▄▄▇████▇▇▇▇▇▆▆▆▆ █
  384 μs        Histogram: log(frequency) by time       849 μs <

 Memory estimate: 16.95 KiB, allocs estimate: 412.
BenchmarkTools.Trial: 38 samples with 1 evaluation.
 Range (min … max):  119.892 ms … 150.380 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     134.175 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   134.351 ms ±   8.712 ms  ┊ GC (mean ± σ):  0.10% ± 0.66%

                  █            ▃    ▃        █  ▃        ▃
  ▇▇▇▇▇▁▁▁▁▁▇▁▇▇▁▇█▇▇▇▁▇▇▁▁▁▇▁▇█▇▁▁▁█▁▁▁▁▁▇▇▇█▁▁█▇▁▇▁▁▁▁▁█▇▁▁▁▇ ▁
  120 ms           Histogram: frequency by time          150 ms <

 Memory estimate: 3.83 MiB, allocs estimate: 99130.

script.jl

using RxInfer, BenchmarkTools, Random

@model function filtering(; c::ConstVariable, v::ConstVariable)

    x_prior_mean = datavar(Float64)
    x_prior_var  = datavar(Float64)

    x_prior ~ Normal(mean = x_prior_mean, variance = x_prior_var)
    x_next ~ x_prior + c

    y = datavar(Float64)
    y ~ Normal(mean = x_next, variance = v)

end

function run_filtering()
    return RxInfer.create_model(filtering(c = 1.0, v = 1.0))
end

@model function smoothing(n; c::ConstVariable, v::ConstVariable)

    x_prior ~ Normal(mean = 0.0, variance = 10000.0)

    x = randomvar(n)
    y = datavar(Float64, n)

    x_prev = x_prior

    for i in 1:n
        x[i] ~ x_prev + c
        y[i] ~ Normal(mean = x[i], variance = v)
        x_prev = x[i]
    end

    return x, y
end

function run_smoothing(n)
    return RxInfer.create_model(smoothing(n, c = 1.0, v = 1.0))
end

benchmark1 = @benchmark run_filtering()
benchmark2 = @benchmark run_smoothing(500)

show(benchmark1)
show(benchmark2)

display(benchmark1)
display(benchmark2)

Feb 16 '24 14:02 bvdmitri

I revisited this. RxInfer.jl had a new major release last week (v3.0.0) and the script I've provided previously does not work with the new version. Here is the new script:

script.jl

using RxInfer, BenchmarkTools, Random

seed = 123

rng = MersenneTwister(seed)

v = 100.0

n = 500

hidden = collect(1:n) data = hidden + rand(rng, Normal(0.0, sqrt(v)), n); datastream = from(data) |> map(NamedTuple{(:y, ), Tuple{Float64}}, (d) -> (y = d, ));

@model function filtering(y, x_prior_mean, x_prior_var, c, v) x_prior ~ Normal(mean = x_prior_mean, variance = x_prior_var) x_next ~ x_prior + c y ~ Normal(mean = x_next, variance = v) end

@model function smoothing(y, c, v)

x_prior ~ Normal(mean = 0.0, variance = 10000.0)
x_prev = x_prior

for i in eachindex(y)
    x[i] ~ x_prev + c
    y[i] ~ Normal(mean = x[i], variance = v)
    x_prev = x[i]
end

end

function filtering_model(y) return RxInfer.create_model(filtering(c = 1.0, v = 1.0)|(y = y, x_prior_mean = 1.0, x_prior_var = 1.0)) end

function smoothing_model(y) return RxInfer.create_model(smoothing(c = 1.0, v = 1.0)|(y = y,)) end

filtering_model(data) smoothing_model(data)

benchmark1 = @benchmark filtering_model($data) benchmark2 = @benchmark smoothing_model($data)

println("Filtering model creation") display(benchmark1) println("Smoothing model creation") display(benchmark2)

function run_filtering(datastream, n, v)

autoupdates = @autoupdates begin
    x_prior_mean, x_prior_var = mean_var(q(x_next))
end

init = @initialization begin
    q(x_next) = NormalMeanVariance(0.0, 10000.0)
end

return infer(
    model          = filtering(c = 1.0, v = v),
    datastream     = datastream,
    autoupdates    = autoupdates,
    initialization = init,
    keephistory    = n,
    historyvars    = (x_next = KeepLast(), ),
    free_energy    = true,
    autostart      = true,
    postprocess    = UnpackMarginalPostprocess()
)

end

function run_smoothing(data, n, v) return infer( model = smoothing(c = 1.0, v = v), data = (y = data, ), returnvars = KeepLast(), free_energy = true, postprocess = UnpackMarginalPostprocess() ) end

run_filtering(datastream, n, v) run_smoothing(data, n, v)

benchmark3 = @benchmark run_filtering($datastream, $n, $v) benchmark4 = @benchmark run_smoothing($data, $n, $v)

println("Filtering inference") display(benchmark3) println("Smoothing inference") display(benchmark4)

Good news are that the performance is actually better on version 1.11-beta1 installed with juliaup. It's about 30% faster in the smoothing regime. I'm not entirely sure if this is because of the major version of RxInfer (we refactored a significant part of the codebase) or it was truly fixed on the Julia side.

Benchmark comparison between `1.10.2` and `1.11-beta1`

Here are the results for 1.10 (installed via juliaup):

→ julia +release --startup-file=no --project=julia1.10 julia1.10/script.jl
Filtering model creation
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):  7.271 μs … 539.958 μs  ┊ GC (min … max): 0.00% … 96.43%
 Time  (median):     7.615 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.163 μs ±  12.567 μs  ┊ GC (mean ± σ):  6.04% ±  3.94%

            ▅▂▃█▃▁▄
  ▁▁▁▁▁▂▂▄▆▇███████▇▇▄▄▄▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  7.27 μs         Histogram: frequency by time        8.73 μs <

 Memory estimate: 19.81 KiB, allocs estimate: 360.
Smoothing model creation
BenchmarkTools.Trial: 1650 samples with 1 evaluation.
 Range (min … max):  2.752 ms …   6.942 ms  ┊ GC (min … max): 0.00% … 54.15%
 Time  (median):     2.836 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.031 ms ± 509.820 μs  ┊ GC (mean ± σ):  5.80% ± 10.60%

  ▃▇█▆▄▁
  ███████▇▆▄▅▆▄▄▄▄▁▁▁▄▄▁▁▁▁▁▄▄▁▄▅▄▁▄▄▇▆▆▇▇█▇▇▆▇█▇▇██▇█▁▇▆▅▆▅▆ █
  2.75 ms      Histogram: log(frequency) by time      4.69 ms <

 Memory estimate: 5.69 MiB, allocs estimate: 114772.
Filtering inference
BenchmarkTools.Trial: 2588 samples with 1 evaluation.
 Range (min … max):  1.848 ms …   3.529 ms  ┊ GC (min … max): 0.00% … 44.64%
 Time  (median):     1.886 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.931 ms ± 238.478 μs  ┊ GC (mean ± σ):  1.94% ±  6.81%

  ▇█▆▅
  █████▇▆▄▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆█▇ █
  1.85 ms      Histogram: log(frequency) by time       3.4 ms <

 Memory estimate: 1.12 MiB, allocs estimate: 32633.
Smoothing inference
BenchmarkTools.Trial: 175 samples with 1 evaluation.
 Range (min … max):  24.887 ms … 104.396 ms  ┊ GC (min … max):  0.00% … 75.94%
 Time  (median):     27.753 ms               ┊ GC (median):     9.27%
 Time  (mean ± σ):   28.711 ms ±   8.320 ms  ┊ GC (mean ± σ):  11.20% ± 10.16%

    ▃█▇         ▄▁▁
  ▃▇████▃▃▁▃▁▄▅▄███▅▅▃▃▃▅▄▃▅▃▄▄▃▅▄▇▅▄▄▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  24.9 ms         Histogram: frequency by time         36.5 ms <

 Memory estimate: 26.59 MiB, allocs estimate: 613635.

And here are the results for 1.11-beta1 (installed via juliaup)

→ julia +1.11 --startup-file=no --project=julia1.11 julia1.11/script.jl
Filtering model creation
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.808 μs …  1.758 ms  ┊ GC (min … max): 0.00% … 98.98%
 Time  (median):     7.133 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.006 μs ± 24.727 μs  ┊ GC (mean ± σ):  8.76% ±  3.37%

        ▅▆▇█▁
  ▁▁▂▃▆██████▆▄▄▃▃▂▂▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  6.81 μs        Histogram: frequency by time        8.79 μs <

 Memory estimate: 19.81 KiB, allocs estimate: 413.
Smoothing model creation
BenchmarkTools.Trial: 1632 samples with 1 evaluation.
 Range (min … max):  2.770 ms …   7.357 ms  ┊ GC (min … max): 0.00% … 58.78%
 Time  (median):     2.843 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.064 ms ± 608.889 μs  ┊ GC (mean ± σ):  6.98% ± 12.24%

  ▅█▇▂
  █████▇▆▁▄▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▁▄▃▅▅▄▅▆▇▆▆▅▆▅█▆█▇▇█▆▆▆▇▄▇▅▆▅▅▆ █
  2.77 ms      Histogram: log(frequency) by time      5.09 ms <

 Memory estimate: 5.75 MiB, allocs estimate: 125717.
Filtering inference
BenchmarkTools.Trial: 3090 samples with 1 evaluation.
 Range (min … max):  1.506 ms …  14.943 ms  ┊ GC (min … max): 0.00% … 88.95%
 Time  (median):     1.568 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.618 ms ± 620.070 μs  ┊ GC (mean ± σ):  3.08% ±  6.71%

      ▁▂▄▆▅▅▄▂▁▂▂▃▄▄▇█▇▅▃
  ▂▃▃▅████████████████████▇▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▁▁▂▂▂▂▂▂▁▂▂▁▂▂▂▂▂▂ ▄
  1.51 ms         Histogram: frequency by time        1.73 ms <

 Memory estimate: 1.07 MiB, allocs estimate: 30647.
Smoothing inference
BenchmarkTools.Trial: 227 samples with 1 evaluation.
 Range (min … max):  20.095 ms … 39.775 ms  ┊ GC (min … max): 0.00% … 45.73%
 Time  (median):     20.692 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.080 ms ±  3.530 ms  ┊ GC (mean ± σ):  6.15% ± 10.73%

  ▃▇█▅
  █████▆▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▄▁▆▄▇▁▆▁▄▆▄▆▁▆▆▁▆▁▆▁▄▁▁▁▁▁▁▁▁▁▁▄ ▆
  20.1 ms      Histogram: log(frequency) by time      34.8 ms <

 Memory estimate: 23.68 MiB, allocs estimate: 595580.

I've also checked the recent master (compiled from sources with make all). The results are pretty similar to 1.11-beta1.

Benchmark comparison between `1.11-beta1` and `master#53f452a7216d5e8af9ccadb4c8a86bcf61466a87`

Here are the results for `1.11-beta1` (installed via `juliaup`)

→ julia +1.11 --startup-file=no --project=julia1.11 julia1.11/script.jl
Filtering model creation
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.808 μs …  1.758 ms  ┊ GC (min … max): 0.00% … 98.98%
 Time  (median):     7.133 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.006 μs ± 24.727 μs  ┊ GC (mean ± σ):  8.76% ±  3.37%

        ▅▆▇█▁
  ▁▁▂▃▆██████▆▄▄▃▃▂▂▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  6.81 μs        Histogram: frequency by time        8.79 μs <

 Memory estimate: 19.81 KiB, allocs estimate: 413.
Smoothing model creation
BenchmarkTools.Trial: 1632 samples with 1 evaluation.
 Range (min … max):  2.770 ms …   7.357 ms  ┊ GC (min … max): 0.00% … 58.78%
 Time  (median):     2.843 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.064 ms ± 608.889 μs  ┊ GC (mean ± σ):  6.98% ± 12.24%

  ▅█▇▂
  █████▇▆▁▄▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▁▄▃▅▅▄▅▆▇▆▆▅▆▅█▆█▇▇█▆▆▆▇▄▇▅▆▅▅▆ █
  2.77 ms      Histogram: log(frequency) by time      5.09 ms <

 Memory estimate: 5.75 MiB, allocs estimate: 125717.
Filtering inference
BenchmarkTools.Trial: 3090 samples with 1 evaluation.
 Range (min … max):  1.506 ms …  14.943 ms  ┊ GC (min … max): 0.00% … 88.95%
 Time  (median):     1.568 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.618 ms ± 620.070 μs  ┊ GC (mean ± σ):  3.08% ±  6.71%

      ▁▂▄▆▅▅▄▂▁▂▂▃▄▄▇█▇▅▃
  ▂▃▃▅████████████████████▇▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▁▁▂▂▂▂▂▂▁▂▂▁▂▂▂▂▂▂ ▄
  1.51 ms         Histogram: frequency by time        1.73 ms <

 Memory estimate: 1.07 MiB, allocs estimate: 30647.
Smoothing inference
BenchmarkTools.Trial: 227 samples with 1 evaluation.
 Range (min … max):  20.095 ms … 39.775 ms  ┊ GC (min … max): 0.00% … 45.73%
 Time  (median):     20.692 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.080 ms ±  3.530 ms  ┊ GC (mean ± σ):  6.15% ± 10.73%

  ▃▇█▅
  █████▆▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▄▁▆▄▇▁▆▁▄▆▄▆▁▆▆▁▆▁▆▁▄▁▁▁▁▁▁▁▁▁▁▄ ▆
  20.1 ms      Histogram: log(frequency) by time      34.8 ms <

 Memory estimate: 23.68 MiB, allocs estimate: 595580.

And here are the result for the master (commit 53f452a7216d5e8af9ccadb4c8a86bcf61466a87)

→ ~/Projects/Julia/julia/julia --startup-file=no julia1.12/script.jl
Filtering model creation
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):  6.906 μs …  1.597 ms  ┊ GC (min … max): 0.00% … 98.97%
 Time  (median):     7.260 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.114 μs ± 28.223 μs  ┊ GC (mean ± σ):  8.69% ±  2.60%

       ▁▅▇██▆▃▂▁
  ▁▁▂▄▇█████████▆▅▄▄▃▄▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  6.91 μs        Histogram: frequency by time        8.75 μs <

 Memory estimate: 19.81 KiB, allocs estimate: 413.
Smoothing model creation
BenchmarkTools.Trial: 1604 samples with 1 evaluation.
 Range (min … max):  2.802 ms …   8.084 ms  ┊ GC (min … max): 0.00% … 62.53%
 Time  (median):     2.880 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.118 ms ± 670.430 μs  ┊ GC (mean ± σ):  7.31% ± 12.73%

  ▅█▆▂▁
  ██████▆▄▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▅▁▁▄▄▅▇▄▆▆▆▆▇▆▇▇▇▆▆▆▇▆▆▆█▇▆▅▄▅ █
  2.8 ms       Histogram: log(frequency) by time      5.35 ms <

 Memory estimate: 5.75 MiB, allocs estimate: 125587.
Filtering inference
BenchmarkTools.Trial: 3092 samples with 1 evaluation.
 Range (min … max):  1.528 ms …  15.176 ms  ┊ GC (min … max): 0.00% … 89.22%
 Time  (median):     1.559 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.616 ms ± 637.183 μs  ┊ GC (mean ± σ):  3.03% ±  6.47%

      ▂▃▆██▆▅▄
  ▂▃▄▇████████▇▇▅▅▃▃▃▃▄▄▄▅▅▅▄▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▂▁▂▁▁▂▂▁▁▁▂▂▂▂▂▁▂ ▃
  1.53 ms         Histogram: frequency by time        1.72 ms <

 Memory estimate: 1.07 MiB, allocs estimate: 30647.
Smoothing inference
BenchmarkTools.Trial: 228 samples with 1 evaluation.
 Range (min … max):  19.823 ms … 41.105 ms  ┊ GC (min … max): 0.00% … 48.46%
 Time  (median):     20.509 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.933 ms ±  3.727 ms  ┊ GC (mean ± σ):  6.33% ± 11.13%

  ▁▅█▇▄
  █████▆▆▁▁▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▄▄▆▆▁▁▆▆▁▆▆▄▆▁▁▆▄▄▁▆▁▁▆▁▁▁▁▆ ▆
  19.8 ms      Histogram: log(frequency) by time      33.4 ms <

 Memory estimate: 23.67 MiB, allocs estimate: 595451.

I think the issue can be closed.

Apr 24 '24 08:04 bvdmitri

julia julia copied to clipboard

Performance regression in RxInfer code on current master

julia
julia copied to clipboard