Caesar.jl icon indicating copy to clipboard operation
Caesar.jl copied to clipboard

Is there any example for mult-architecture support [JuliaGPU]

Open wystephen opened this issue 2 years ago • 7 comments

As mentioned in the document,

Caesar.jl can utilize a combination of four styles of multiprocessing: i) separate memory multi-process; ii) shared memory multi-threading; iii) asynchronous shared-memory (forced-atomic) co-routines; and iv) multi-architecture such as JuliaGPU. As of Julia 1.4, the most reliable method of loading all code into all contexts (for multi-processor speedup) is as follows.

Is there any example for solve factor graph optimization problem use GPU?

wystephen avatar Apr 05 '22 15:04 wystephen

Hi, we have not published a public template for GPU based computation yet. This is referring to the design of how factors sampling and residuals are calculated. We worked to avoid restrictions as far possible regarding where computations are done. So if the sampling process or residual computation can benefit from GPU, that is the place to do it first.

We previously did work on GPU for factors, which computed large FFTs using the GPU during factor residual computations. It was clunky before the upgrades, but since then we have significantly improved the API but not yet updated the legacy GPU implementation from that project. We are now mostly settled on the residual and sampling API though, so at least that should be stable for a good while.

dehann avatar Apr 06 '22 15:04 dehann

First of all, thanks for your reply. As you described, the primary support for GPU is that users can use GPU in calculating a residual function. Adopting GPU will significantly reduce computation time if I use few residuals, which is computationally expensive. But, if I need to calculate lots of residuals, it is hard to benefit from the GPU support. Is this description correct?

wystephen avatar Apr 07 '22 01:04 wystephen

thanks for your reply.

Yeah, of course -- we try answer as soon as possible, not always able to :-P

need to calculate lots of residuals, it is hard to benefit from the GPU support. Is this description correct?

Not quite: When a factor computation is done, many particles are computed for each factor calculation, and each of those particles are updated using optimization routines that call the residual functions many many times over. So residual functions are called liek 10^5-10^8 times per solve. So in some cases GPU can help a lot.

Let's say you have a residual computation that is intensive and takes 100ms on CPU with threading etc. If you were to port to GPU and get a 20ms time for equivalent computation, thats good.

Next step, Caesar.jl packages when solving the factor graph make extensive use of the residual and sampling functions. So if the overhead of the GPU implementation adds like 10-20% from all the calculations, you will still be at ~30ms rather than CPU 100ms per residual, so everything will be a lot faster.

Flipside is when the CPU computation for residual or sampling is like 10us, but the GPU takes longer due to memory management, etc. then the residual compute will be thrashing and will not help.

Places where GPU work is when you have dense data that needs to be processed in a data intensive way. Many of the current factors in Caesar.jl system compute in the order 500ns-ish, so there is no point to try make them work on GPU. There are cases where GPU will help with some of the standard factors, we just haven't gotten around to making an example for the community yet.

The caveat, as I alluded to above, is when you have large computations within each residual or sampling cycle. Then GPU is great. This is how we used it in a previous project.

Can you perhaps say more about your application to see if we can point you in a good direction? What is the main issue you're trying to get around?

dehann avatar Apr 07 '22 06:04 dehann

I have to achieve two applications using graph optimization. Firstly, I intend to use a factor graph to solve the long-term AHRS problem. Each factor is quite simple and may cost 1~2 ms. So, it is hard to benefit from the GPU-based factor calculation. By the way, I found that solving the tree takes 96% time to compile. And recompile every time when the shape of the graph changes. Is this phenomenon normal? Second, I intend to solve a standard large multi-robot trajectory fusion. In detail, I have several trajectories, which are about several hours. And there are several distance constraints (loop-closure) between different keyframes. So, this problem may not benefit from GPU.

wystephen avatar Apr 07 '22 08:04 wystephen

use a factor graph to solve the long-term AHRS problem

Cool, think I follow.

By the way, I found that solving the tree takes 96% time to compile. And recompile every time when the shape of the graph changes. Is this phenomenon normal?

No, that is not normal. Compiling should really only be the first, during warmupJIT, or precompile -- there should be no compiling in tree construction, even if the graph changes. Will have to look at that more closely:

  • JuliaRobotics/IncrementalInference.jl#1513

Second, I intend to solve a standard large multi-robot trajectory fusion. In detail, I have several trajectories, which are about several hours. And there are several distance constraints (loop-closure) between different keyframes.

Got it, so that is something we are interested in -- we have some capabilities there. Let us know how it goes, or if that is something you might want to offload -- likely a longer conversation. Cc'ing @GearsAD .

So, this problem may not benefit from GPU.

Probably not in this way directly. Indirectly, however, there are improvements to be had by upgrading some of the backend non-Gaussian computation steps using the GPU. But that is in the medium-long term roadmap. So there is potential for "GPU makes SLAM solve faster", but likely not in the timeframe for your project at the moment, hence we're not advertising that too loudly at this stage.

dehann avatar Apr 08 '22 00:04 dehann

Firstly, for the first answer, as I am not familiar with Julia. I test time and check whether it is compiled every time based on the following code:

`
function test(n::Integer) # /// n variables
# build some fg
@time solveGraph!(fg)
end

@time test(2)
@time test(3)
@time test(4)
@time test(5)
@time test(50)
`

and get the following output:


`
 25.603175 seconds (55.72 M allocations: 3.287 GiB, 3.23% gc time, 99.65% compilation time) ### for in function time
 31.207331 seconds (70.92 M allocations: 4.275 GiB, 3.60% gc time, 99.63% compilation time) ### for test() time.

 1.667911 seconds (3.45 M allocations: 193.822 MiB, 3.02% gc time, 96.59% compilation time)
 2.498532 seconds (5.70 M allocations: 320.521 MiB, 4.49% gc time, 97.13% compilation time)

 0.914279 seconds (2.43 M allocations: 138.633 MiB, 5.36% gc time, 93.99% compilation time)
 1.678358 seconds (4.69 M allocations: 265.906 MiB, 2.92% gc time, 95.78% compilation time)

 1.188344 seconds (2.79 M allocations: 161.149 MiB, 4.10% gc time, 95.38% compilation time)
 2.018074 seconds (5.06 M allocations: 288.978 MiB, 5.30% gc time, 96.44% compilation time)

 26.730427 seconds (62.77 M allocations: 3.406 GiB, 3.70% gc time, 98.01% compilation time)
 62.106138 seconds (163.34 M allocations: 8.934 GiB, 3.97% gc time, 98.29% compilation time)
`

The compile time is quick except the first time; why do other times have compilation time?

Furthermore, why recompile whenever adding a new factor? Its output is like this:

`
[ Info: try doautoinit! of x57
[ Info: try doautoinit! of bm
[ Info: init with useinitfct Symbol[]
[ Info: try doautoinit! of mn
[ Info: init with useinitfct Symbol[]
  0.382486 seconds (1.12 M allocations: 62.910 MiB, 98.27% compilation time)
`

The factor described before looks like this;

`
Base.@kwdef struct AbsMagFactor{T <: AbstractFloat} <: IIF.AbstractManifoldMinimize
    Z::Caesar.Distribution # = Distributions.MvNormal(LinearAlgebra.diagm([5.0 * ones(3)])) #zero-mean Gaussian distribution 
    gyr_data::Matrix{T}
    mag_data::Matrix{T}
    m0_A::Vector{T}
    m0_B::SMatrix{3, 3, T}
    """

    """
    function AbsMagFactor{ T}(gyr_data::Matrix{T}, mag_data::Matrix{T}, dt::T = 0.01) where {T<:AbstractFloat}
        m0_A, m0_B = integrate_avg_mag(gyr_data, mag_data, dt)  // this code not recompile and just use time less than 1 ms.
        new(MvNormal(zeros(T, 3), Matrix{T}(I, 3, 3)), gyr_data, mag_data, m0_A,  m0_B)
    end
end

DFG.getManifold(::AbsMagFactor{T})  where {T <: AbstractFloat} = Manifolds.Euclidean(3)

function (cf::CalcFactor{<:AbsMagFactor{T}})(X, R, bm, mag_field_vec) where {T <: AbstractFloat}
    return mag_field_vec - embed(SpecialOrthogonal(3), R, X.m0_A - X.m0_B * bm)
end

function IIF.getSample(cf::CalcFactor{<:AbsMagFactor})
    return rand(cf.factor.Z)
end
`

I need more time to familiarize myself with this framework for my second target[2d slam with loop-closure].

wystephen avatar Apr 10 '22 03:04 wystephen

Hi @wystephen ,

Thanks for pointing out the slow compile times on for growing the graph. That was was just found and fixed here:

  • JuliaRobotics/IncrementalInference.jl#1565

dehann avatar Jul 15 '22 16:07 dehann