StochasticDiffEq.jl icon indicating copy to clipboard operation
StochasticDiffEq.jl copied to clipboard

Try VectorizedRNG.jl

Open ChrisRackauckas opened this issue 5 years ago • 8 comments
trafficstars

I just seem to get segfaults though.

ChrisRackauckas avatar May 18 '20 01:05 ChrisRackauckas

using StochasticDiffEq
using BenchmarkTools

α=1
β=1
u0=[1/2]
f(du,u,p,t) = (du .= u)
g(du,u,p,t) = (du .= u)
dt = 1//2^(4)
tspan = (0.0,1.0)
prob = SDEProblem(f,g,u0,(0.0,1.0))
sol = solve(prob,EM(),dt=dt)
@code_warntype solve(prob,EM(),dt=dt)
@btime sol = solve(prob,EM(),dt=dt)

Is the test case

ChrisRackauckas avatar May 18 '20 01:05 ChrisRackauckas

Note this is on top of https://github.com/SciML/StochasticDiffEq.jl/pull/291

ChrisRackauckas avatar May 18 '20 01:05 ChrisRackauckas

I just seem to get segfaults though.

Yikes. I'll look into it. EDIT Travis says:

signal (11): Segmentation fault
in expression starting at /home/travis/build/SciML/StochasticDiffEq.jl/test/first_rand_test.jl:23
macro expansion at /home/travis/.julia/packages/SIMDPirates/ShWZm/src/memory.jl:131 [inlined]
vload at /home/travis/.julia/packages/SIMDPirates/ShWZm/src/memory.jl:106 [inlined]
vloada at /home/travis/.julia/packages/SIMDPirates/ShWZm/src/memory.jl:295 [inlined]
getstate at /home/travis/.julia/packages/VectorizedRNG/65mRL/src/xoshiro.jl:98 [inlined]
random_sample! at /home/travis/.julia/packages/VectorizedRNG/65mRL/src/api.jl:140
randn! at /home/travis/.julia/packages/VectorizedRNG/65mRL/src/api.jl:180 [inlined]
wiener_randn! at /home/travis/.julia/packages/DiffEqNoiseProcess/w4DdT/src/wiener.jl:9 [inlined]
INPLACE_WHITE_NOISE_DIST at /home/travis/.julia/packages/DiffEqNoiseProcess/w4DdT/src/wiener.jl:42
calculate_step! at /home/travis/.julia/packages/DiffEqNoiseProcess/w4DdT/src/noise_interfaces/simple_noise_process_interface.jl:49 [inlined]
setup_next_step! at /home/travis/.julia/packages/DiffEqNoiseProcess/w4DdT/src/noise_interfaces/simple_noise_process_interface.jl:44 [inlined]
setup_next_step! at /home/travis/build/SciML/StochasticDiffEq.jl/src/integrators/integrator_utils.jl:2 [inlined]
#__init#48 at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:488
unknown function (ip: 0x7f60045dbaef)
__init##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:64 [inlined]
__init##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:64 [inlined]
#__solve#47 at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6 [inlined]
__solve##kw at /home/travis/build/SciML/StochasticDiffEq.jl/src/solve.jl:6
unknown function (ip: 0x7f60045d7595)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2158 [inlined]

chriselrod avatar May 18 '20 01:05 chriselrod

What is _seed? My fault for not documenting

help?> local_rng()
  No documentation found.

But local_rng's argument is the thread id - 1. If you pass it an integer greater than or equal the number of threads (or <0), it'll segfault.

VectorizedRNG will allocate memory for Threads.nthreads() states, and then local_rng returns a state that just wraps a pointer:

julia> local_rng(11212351235412351)
VectorizedRNG.Xoshift{4}(Ptr{UInt64} @0x9f56a135d27e6d00)

julia> local_rng(0)
VectorizedRNG.Xoshift{4}(Ptr{UInt64} @0x00005563e3387100)

julia> local_rng()
VectorizedRNG.Xoshift{4}(Ptr{UInt64} @0x00005563e3387100)

This is mostly to let it control alignment. Also means that, while the states are close together in memory, cache line boundaries should be aligned with different thread local parts, so it should be fine.

chriselrod avatar May 18 '20 02:05 chriselrod

If I remove the _seed argument as a hack, tests Convergence Test on 2D Linear failed locally:

Test Summary:                    | Pass  Fail  Total
Two-dimensional Linear SDE Tests |    1    37     38

Benchmarks: Latest release:

julia> @benchmark sol = solve($prob,EM(),dt=$dt)
BenchmarkTools.Trial:
  memory estimate:  23.88 KiB
  allocs estimate:  118
  --------------
  minimum time:     5.051 μs (0.00% GC)
  median time:      5.311 μs (0.00% GC)
  mean time:        6.844 μs (20.10% GC)
  maximum time:     503.748 μs (96.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

This branch:

julia> @benchmark sol = solve($prob,EM(),dt=$dt)
BenchmarkTools.Trial:
  memory estimate:  6.22 KiB
  allocs estimate:  95
  --------------
  minimum time:     3.863 μs (0.00% GC)
  median time:      4.177 μs (0.00% GC)
  mean time:        4.991 μs (14.98% GC)
  maximum time:     709.685 μs (99.14% GC)
  --------------
  samples:          10000
  evals/sample:     8

small_overhead (291):

julia> @benchmark sol = solve($prob,EM(),dt=$dt)
BenchmarkTools.Trial:
  memory estimate:  6.17 KiB
  allocs estimate:  95
  --------------
  minimum time:     2.902 μs (0.00% GC)
  median time:      3.150 μs (0.00% GC)
  mean time:        3.923 μs (18.48% GC)
  maximum time:     636.300 μs (99.32% GC)
  --------------
  samples:          10000
  evals/sample:     9

So it looks like this regresses performance.

What are the lengths of the involved arrays? Any good way to investigate this regression?

chriselrod avatar May 18 '20 02:05 chriselrod

Awesome, that got it working.

What are the lengths of the involved arrays?

Here, just 1. It's of course problem dependent, from 1 to 100,000.

Any good way to investigate this regression?

Let me get that small optimizations branch merged and it should be easier. All this PR adds to it though is just changing out the RNG, so it's really this vs Xorshifts.Xoroshiro128Plus

ChrisRackauckas avatar May 18 '20 02:05 ChrisRackauckas

I assume that VectorizedRNG.jl only supports Arrays of Float64? Could it also do out of place with static array like things?

ChrisRackauckas avatar May 18 '20 02:05 ChrisRackauckas

Awesome, that got it working. All this PR adds to it though is just changing out the RNG, so it's really this vs Xorshifts.Xoroshiro128Plus

I'm worried about the test failure that I got, and while not a failure, these warnings did not appear when testing the small_overhead branch:

Solve and Plot
┌ Warning: Interrupted. Larger maxiters is needed.
└ @ DiffEqBase ~/.julia/packages/DiffEqBase/SMlFm/src/integrator_interface.jl:329
┌ Warning: Interrupted. Larger maxiters is needed.
└ @ DiffEqBase ~/.julia/packages/DiffEqBase/SMlFm/src/integrator_interface.jl:329
┌ Warning: Interrupted. Larger maxiters is needed.
└ @ DiffEqBase ~/.julia/packages/DiffEqBase/SMlFm/src/integrator_interface.jl:329

VectorizedRNG does reasonably well vs the smallcrush RNG test, but the sin approximation might need to be a little more accurate. I'm sure a numerical analyst like @YingboMa can do better than the Remez algorithm. Or I could, given the time to find out how people normally come up with SIMD special function implementations.

Also, it's using a Xoshiro256plus. Maybe I should implement the Xoroshiro128Plus.

Here, just 1. It's of course problem dependent, from 1 to 100,000.

Ah. 1 is a problem for performance.

julia> rng = Xorshifts.Xoroshiro128Plus()
RandomNumbers.Xorshifts.Xoroshiro128Plus(0xef4459b15a5cab2a, 0x33d487d5296eb0e2)

julia> x = [0.0];

julia> @benchmark randn!($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.429 ns (0.00% GC)
  median time:      9.852 ns (0.00% GC)
  mean time:        9.880 ns (0.00% GC)
  maximum time:     41.759 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark randn!(local_rng(), $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.980 ns (0.00% GC)
  median time:      22.052 ns (0.00% GC)
  mean time:        22.084 ns (0.00% GC)
  maximum time:     48.035 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> @benchmark randn!($rng, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.235 ns (0.00% GC)
  median time:      11.645 ns (0.00% GC)
  mean time:        11.665 ns (0.00% GC)
  maximum time:     21.735 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> x = Vector{Float64}(undef, 15);

julia> @benchmark randn!($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     50.670 ns (0.00% GC)
  median time:      52.744 ns (0.00% GC)
  mean time:        52.787 ns (0.00% GC)
  maximum time:     88.798 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990

julia> @benchmark randn!(local_rng(), $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     28.716 ns (0.00% GC)
  median time:      29.022 ns (0.00% GC)
  mean time:        29.019 ns (0.00% GC)
  maximum time:     52.158 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

julia> @benchmark randn!($rng, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     48.997 ns (0.00% GC)
  median time:      50.915 ns (0.00% GC)
  mean time:        50.944 ns (0.00% GC)
  maximum time:     67.167 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     989

That should be addressable though:

  1. Checking array size to generate smaller batches:
julia> @btime randn(local_rng(), NTuple{1,NTuple{2,Core.VecElement{Float64}}})
  13.260 ns (0 allocations: 0 bytes)
((VecElement{Float64}(0.057132286810525984), VecElement{Float64}(0.7160102533566447)),)

For very short arrays. It can generate a single uniform at a time:

julia> @btime rand(local_rng(), NTuple{1,NTuple{1,Core.VecElement{Float64}}})
  2.269 ns (0 allocations: 0 bytes)
((VecElement{Float64}(0.30066402503431744),),)

But that isn't an option right now for normals because of the Box-Muller algoirthm. As the Box-Muller algorithm is a really slow way to generate 2 random normals anyways, we should be falling back to the Ziggurat algorithm for small (non-SIMD) batches of normals.

Alternatively, have buffers of perhaps 256 random numbers stored, and simply load from this buffer (refilling the buffer every time it reaches the end).

I assume that VectorizedRNG.jl only supports Arrays of Float64? Could it also do out of place with static array like things?

Currently, it only supports DenseArrays that let you get pointers.

It typically generates NTuple{N,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}} where N is 1, 2, or 4, and then uses the pointer to store them into the array (and mask off a remainder, which should be efficient with AVX [uses vmaskmov] and especially AVX512 [uses opmask registers]).

julia> randn(local_rng(), NTuple{4,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
((VecElement{Float64}(-0.35070219008784215), VecElement{Float64}(-1.1500644689621624), VecElement{Float64}(-1.6140966900130058), VecElement{Float64}(-0.13218650868339507), VecElement{Float64}(-1.145717731296182), VecElement{Float64}(-1.4488616660755809), VecElement{Float64}(-1.219831252656312), VecElement{Float64}(0.26954048537845754)), (VecElement{Float64}(-0.8649042896705302), VecElement{Float64}(1.1491223266582307), VecElement{Float64}(-0.8821573749577012), VecElement{Float64}(0.19394881631501515), VecElement{Float64}(1.957643704171434), VecElement{Float64}(-2.927466363335935), VecElement{Float64}(-0.14172465407366736), VecElement{Float64}(-1.5318948796751526)), (VecElement{Float64}(-0.34632071461043773), VecElement{Float64}(0.17695680200558128), VecElement{Float64}(0.34350477272227437), VecElement{Float64}(1.2389961224532435), VecElement{Float64}(1.1252296319202757), VecElement{Float64}(1.8952112260924336), VecElement{Float64}(0.7779111642563302), VecElement{Float64}(-1.0682395626203594)), (VecElement{Float64}(-0.40499368481842707), VecElement{Float64}(-0.5697825884055833), VecElement{Float64}(0.9163026476719056), VecElement{Float64}(-1.6460932788182563), VecElement{Float64}(-1.071672933005311), VecElement{Float64}(0.6215884503640314), VecElement{Float64}(-0.09387081715702705), VecElement{Float64}(0.07085548151904485)))

julia> @benchmark randn(local_rng(), NTuple{4,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     35.995 ns (0.00% GC)
  median time:      36.122 ns (0.00% GC)
  mean time:        36.191 ns (0.00% GC)
  maximum time:     76.804 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> @benchmark randn(local_rng(), NTuple{2,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.060 ns (0.00% GC)
  median time:      21.076 ns (0.00% GC)
  mean time:        21.101 ns (0.00% GC)
  maximum time:     30.223 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark randn(local_rng(), NTuple{1,NTuple{VectorizedRNG.W64,Core.VecElement{Float64}}})
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.668 ns (0.00% GC)
  median time:      16.484 ns (0.00% GC)
  mean time:        16.315 ns (0.00% GC)
  maximum time:     44.946 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

So for out of place support, you just need another way to get these into your preferred data structure. For something like a StaticArray, that would probably mean getindex and .value as necessary to turn these into regular NTuples. At small enough sizes, the compiler should elliminate all of that and more or less generate the code we wanted anyway.

chriselrod avatar May 18 '20 03:05 chriselrod

VectorizedRNG is only really good at batched sampling. Even though Base's RNG is also offers a vectorized Xoshiro RNG, VectorizedRNG.jl is still a decent bit faster

julia> using VectorizedRNG, BenchmarkTools, Random

julia> x = Vector{Float64}(undef, 1024);

julia> @btime rand!($(Random.default_rng()), $x);
  274.723 ns (0 allocations: 0 bytes)

julia> @btime rand!($(VectorizedRNG.local_rng()), $x);
  178.311 ns (0 allocations: 0 bytes)

julia> @btime randn!($(Random.default_rng()), $x);
  1.660 μs (0 allocations: 0 bytes)

julia> @btime randn!($(VectorizedRNG.local_rng()), $x);
  1.213 μs (0 allocations: 0 bytes)

Here, just 1. It's of course problem dependent, from 1 to 100,000.

So if it makes sense to do larger batches, we could consider this again.

chriselrod avatar Nov 17 '23 12:11 chriselrod

Yeah, this is just so stale and it's not generally the right thing to do, so no reason to keep the branch around. A special noise process can be made around this if big SPDEs run into this as the time limiting part.

ChrisRackauckas avatar Nov 17 '23 14:11 ChrisRackauckas