RoME.jl
RoME.jl copied to clipboard
Running Rome in multiple julia instances freezes solving
Hi,
This is a issue very specific to experiments for my thesis/application:
As I want to perform a lot of individual solves for my thesis, I run multiple julia instances (each using multiple workers and threads) on a pretty beefy machine. This seems to be faster than running them sequentially, because parallelism can't be exploited everywhere. It apparently worked fine until the RoME v0.13 release (I think). Now the solver just "hangs up" in all but one instance (this happens at a different pose for every time I try).
This is the last thing that gets written to stdout (aka my logs ;)):
Solve Progress: approx max 1752, at iter 711 Time: 0:01:50[K[ Info: CSM-5 Clique 47 finished
Solve Progress: approx max 1752, at iter 716 Time: 0:01:51[K
I run the multiple julia instances each in their own screen
session like
screen -S <sessionname>
And then fire up my my evaluation script that reads my dataset from a matfile and takes some other arguments (command gets auto-generated by my matlab frontend)
julia -t auto -p 8 --project=Masterarbeit/svn/julia -J ~/.julia/sysimage_RoME.so -- Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl --path tmp --file sigmaTrajTf-0-001-0-000-0-005_sigmaLmBearingAndRanging-0-004-0-001_wrongRatio-1-000_minDist-0-200_maxDist-2-000.mat --trajKeys tf --lmKeys bearingAndRanging --startPoseIdx 1 --endPoseIdx 150 --startPoseVal "[7.15793412168699;3.38794983135837;-1.36840501375598]" --plotSaveFinal 1 --plotSaveIter 1 --nRuns 1 --suffix variables --useMsgLikelihoods 0 --nullhypo 0.000000 --tukey 15.000000 --nKernels 100 --spreadNH 3.000000 --inflation 3.000000
I will resort to running things sequentally (as any online application would), but especcialy for tuning out parameters like spreadNH
or inflation
running things in parallel was very useful, as I don't have access to an infinite amount of machines ;) As multiprocess/multithread performance increased in the last releases, I suspect that something (maybe not even directly RoME-related, but in Julia general) gets in each others way, some lock does not get released.
Best, Leo
Hi @lemauee ,
Oops that is odd. This might be something upstream. We'd have to look a bit more at what's happening. In the mean time, perhaps we can help compute results on local servers. Do you perhaps want to send tar.gz files of graphs you want solved and I can run on compute power here. Will return the result in new tar.gz. Perhaps in the process we start getting the same error here. I'm quite surprised a multi-Julia issue might be happening. If not there then something in Distributed.jl perhaps. Either way, we'd likely have to fix locally in the short term.
Sure you already using as such, but just in case: You can generate with fg = initfg(); getSolverParams(fg).graphinit=false
and then the generation of graphs should be pretty quick. Solving can then use either graphinit
(might not follow exact sequence) or treeinit
(our desired but still experimental) method at later solve.
Lastly, I just merged (what I expect) the fix for your issue #378. Will tag IIF v0.21.3 in a few hours.
cc, might be worth @jim-hill-r (hi!) to look out for something like this multiple Julia's issue too.
Best, Dehann
Hi @dehann,
I don't know if sending you the tar.gz would help to reproduce the issue as expected, as I'm still doing a solve for every pose (I know that this is slow, but consistency to the online-approach we use in classical parametric fgs was important to us).
However, I'd happily send you the Julia-Project needed to run my stuff, the matfile and the commands to invoke the solve. Otherwise I could also set up the stuff through ssh on your server and then see if i can reproduce it there.
Best, Leo
The same hangup just happened when running only one julia instance, so this might not even be multiinstance-related (but the test yesterday suggested so...). Also, it occured on one of my private and not the universitys pcs. I recently did a fresh ubuntu install there, so nothing strange in the environment should hinder things there. Maybe its also worth to give it a try without a precompiled sysimage as I once had problems with RoMEPlotting as seen in JuliaRobotics/RoMEPlotting.jl#140 as you know.
Best, Leo
Whats also notable is that there is absolutely nothing executed (no computational load) present when the hangup occurs (no CPU load in top
or something similar can be seen).
Just a sanity check question, did you update the precompiled sysimage with the latest tags?
Other than that, do you know what clique is stalling?
One way to find out is to draw the tree (drawtree=true, showtree=true) and see at what clique it is waiting on.
Then you can try turning on debug messages for that clique CSM, say for 2:
ENV["JULIA_DEBUG"] = :csm_2
And post results here please.
I just compiled the sysimage this morning using RoME v0.13.0 and IIF v0.21.2 . The problem is that the clique it stopped on was another one everytime I think, but still need to confirm that. Some examples even ran through that just differed by the amount of noise or even just the setting of the spreadNH parameter and still "made it differently far". I just started a run without the compiled sysimage to roule out that source of failure. To draw the tree I'll need a machine with x-server to start some pdf viewer, right?
To draw the tree I'll need a machine with x-server to start some pdf viewer, right?
You should be able to set drawtree=true
and then open the tree over SSH, I haven't done it before, so just guessing.
Edit: The tree should be at the logger file location or /tmp/caesar
You could also try setting a timeout and then seeing if there is some more information on where it timed out, eg:
solveTree!(fg, timeout=70)
timeout is in seconds I suppose? Then I should give it half an hour (1800s) or something for my example to be safe I'd say ;) I will start an example with and without sysimage inclding drawtree, showtree and timeout now and report back.
Seconds, yes. It is per step for every clique individually, so should be fine lower.
So lets make it 5 minutes, 300s to be save not to kill something prematurely
There are a few tools that we use to debug the CSM, you can also try this one: https://github.com/JuliaRobotics/IncrementalInference.jl/issues/443#issuecomment-699843926
Just got an error when trying to run with drawtree, showtree and timeout:
ERROR: LoadError: MethodError: no method matching solveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree; multithread=true, drawtree=true, showtree=true, timeout=300)
Closest candidates are:
solveTree!(::AbstractDFG, ::AbstractBayesTree; timeout, storeOld, verbose, verbosefid, delaycliqs, recordcliqs, limititercliqs, injectDelayBefore, skipcliqids, eliminationOrder, variableOrder, eliminationConstraints, variableConstraints, smtasks, dotreedraw, runtaskmonitor, algorithm, multithread) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:258 got unsupported keyword arguments "drawtree", "showtree"
solveTree!(::AbstractDFG) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:258 got unsupported keyword arguments "multithread", "drawtree", "showtree", "timeout"
Stacktrace:
[1] kwerr(::NamedTuple{(:multithread, :drawtree, :showtree, :timeout),Tuple{Bool,Bool,Bool,Int64}}, ::Function, ::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree) at ./error.jl:157
Sorry should have been clearer:
getSolverParams(fg).drawtree = true
getSolverParams(fg).showtree = true
There are a few tools that we use to debug the CSM, you can also try this one: JuliaRobotics/IncrementalInference.jl#443 (comment)
This will be difficult if julia "hangs up" right? I dont now how to do the fetching step after the repl is frozen (i would even need to find a way to run my script in the repl with command line arguments first).
If the timeout is triggered it should stop everything and continue with the next lines.
How hard are the performance hits of enabling this? Is there a good way to write this debugging info to file to include it into my normal workflow? Or how to trigger this best on the timeout kicking in?
I don't think you want it in normal operation. You can remove the recordcliqs=ls(fg)
or just record a few if you know where the problem is.
You can add this without a performance hit: https://github.com/JuliaRobotics/IncrementalInference.jl/issues/443#issuecomment-785895964
Opening the bt.pdf
(using evince) to see the solve progress currently fails on my machine:
Der Dateityp Graphviz-DOT-Graph (text/vnd.graphviz) wird nicht unterstützt
Seems like its no pdf but a dot file.
The dot file using the dot viewer seems to be working though :)
Also updates fine through sshfs
You need graphviz and xdot, https://juliarobotics.org/Caesar.jl/latest/installation_environment/#Local-Dependencies
Its clearly reproducible when running only one julia instance, I suspect using more workers makes this appear more often.
Tree:
cliq4_stacktrace:
InterruptException:
Stacktrace:
[1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
[2] wait at ./task.jl:713 [inlined]
[3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
[4] _wait(::Task) at ./task.jl:238
[5] sync_end(::Channel{Any}) at ./task.jl:294
[6] macro expansion at ./task.jl:333 [inlined]
[7] solveCliqDownFrontalProducts!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::IncrementalInference.TreeClique, ::SolverParams, ::Base.CoreLogging.SimpleLogger; MCIters::Int64) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:491
[8] solveCliqDownFrontalProducts! at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:469 [inlined]
[9] solveDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:654
[10] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
[11] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
[12] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
[13] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169
cliq4_csm.txt is empty
cliq14_stacktrace:
InterruptException:
Stacktrace:
[1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
[2] wait at ./task.jl:713 [inlined]
[3] wait(::Base.GenericCondition{ReentrantLock}) at ./condition.jl:106
[4] take_unbuffered(::Channel{LikelihoodMessage}) at ./channels.jl:405
[5] take!(::Channel{LikelihoodMessage}) at ./channels.jl:381
[6] takeBeliefMessageDown!(::MetaBayesTree, ::LightGraphs.SimpleGraphs.SimpleEdge{Int64}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/TreeMessageAccessors.jl:115
[7] waitForDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:467
[8] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
[9] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
[10] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
[11] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169
cliq14_csm.txt is empty
cliq29_stacktrace:
InterruptException:
Stacktrace:
[1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
[2] wait at ./task.jl:713 [inlined]
[3] wait(::Base.GenericCondition{ReentrantLock}) at ./condition.jl:106
[4] take_unbuffered(::Channel{LikelihoodMessage}) at ./channels.jl:405
[5] take!(::Channel{LikelihoodMessage}) at ./channels.jl:381
[6] takeBeliefMessageDown!(::MetaBayesTree, ::LightGraphs.SimpleGraphs.SimpleEdge{Int64}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/TreeMessageAccessors.jl:115
[7] waitForDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:467
[8] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
[9] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
[10] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
[11] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169
cliq29_csm.txt is empty
last lines of stdout:
f3b00c5b/etc/fonts/conf.d/80-delicious.conf", line 6: invalid attribute 'version'
Fontconfig warning: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 4: unknown element "its:rules"
Fontconfig warning: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 5: unknown element "its:translateRule"
Fontconfig error: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 5: invalid attribute 'translate'
Fontconfig error: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 5: invalid attribute 'selector'
Fontconfig error: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 6: invalid attribute 'xmlns:its'
Fontconfig error: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 6: invalid attribute 'version'
Fontconfig error: Cannot load config file from /home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/fonts.conf
┌ Error: Task 14 failed, sending error to all cliques
└ @ IncrementalInference ~/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:183
[ Info: All cliques should have exited
┌ Error: Task 29 failed, sending error to all cliques
└ @ IncrementalInference ~/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:183
[ Info: All cliques should have exited
InterruptException:
Stacktrace:
[1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
[2] wait at ./task.jl:713 [inlined]
[3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
[4] _wait(::Task) at ./task.jl:238
[5] sync_end(::Channel{Any}) at ./task.jl:294
[6] macro expansion at ./task.jl:333 [inlined]
[7] solveCliqDownFrontalProducts!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::IncrementalInference.TreeClique, ::SolverParams, ::Base.CoreLogging.SimpleLogger; MCIters::Int64) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:491
[8] solveCliqDownFrontalProducts! at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:469 [inlined]
[9] solveDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:654
[10] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
[11] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
[12] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
[13] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169┌ Warning: printCliqHistorySummary -- No CSM history found.
└ @ IncrementalInference ~/.julia/packages/IncrementalInference/nmcd8/src/TreeDe
bugTools.jl:211
ERROR: LoadError: TaskFailedException:
InterruptException:
Stacktrace:
[1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
[2] wait at ./task.jl:713 [inlined]
[3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
[4] _wait(::Task) at ./task.jl:238
[5] sync_end(::Channel{Any}) at ./task.jl:294
[6] macro expansion at ./task.jl:333 [inlined]
[7] solveCliqDownFrontalProducts!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::IncrementalInference.TreeClique, ::SolverParams, ::Base.CoreLogging.SimpleLogger; MCIters::Int64) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:491
[8] solveCliqDownFrontalProducts! at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:469 [inlined]
[9] solveDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:654
[10] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
[11] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
[12] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
[13] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169
...and 2 more exception(s).
Stacktrace:
[1] sync_end(::Channel{Any}) at ./task.jl:314
[2] macro expansion at ./task.jl:333 [inlined]
[3] taskSolveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64; oldtree::MetaBayesTree, drawtree::Bool, verbose::Bool, verbosefid::Base.TTY, limititers::Int64, limititercliqs::Array{Pair{Symbol,Int64},1}, downsolve::Bool, incremental::Bool, multithread::Bool, skipcliqids::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, delaycliqs::Array{Symbol,1}, smtasks::Array{Task,1}, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:49
[4] solveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree; timeout::Int64, storeOld::Bool, verbose::Bool, verbosefid::Base.TTY, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, limititercliqs::Array{Pair{Symbol,Int64},1}, injectDelayBefore::Nothing, skipcliqids::Array{Symbol,1}, eliminationOrder::Nothing, variableOrder::Nothing, eliminationConstraints::Array{Symbol,1}, variableConstraints::Nothing, smtasks::Array{Task,1}, dotreedraw::Array{Int64,1}, runtaskmonitor::Bool, algorithm::Symbol, multithread::Bool) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:371
[5] macro expansion at /home/lemau/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:163 [inlined]
[6] macro expansion at ./timing.jl:233 [inlined]
[7] top-level scope at /home/lemau/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:159
[8] include(::Function, ::Module, ::String) at ./Base.jl:380
[9] include(::Module, ::String) at ./Base.jl:368
[10] exec_options(::Base.JLOptions) at ./client.jl:296
[11] _start() at ./client.jl:506
in expression starting at /home/lemau/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:128
lemau@joule:~$
I will run it again and see if it fails at a different spot.
Can you upload the saved factor graph for the above example and I'll see if I can reproduce it.
It is clique 4 that is timing out on the down solve. We haven't seen something like this before.
Unfortunately as I ran this as a script and not from the REPL (as I would have to look into how to supply all the command line arguments to my script in the REPL) I only have the "pose-iteration" before the failure. I can inform myself (or do you know an easy way how to do this) and do this in a later run and save it out, but I'd wait for the repetition I run now (should be finished in 1-2 hours if it breaks at the same spot) to see if it breaks at the same spot.
Another thing you could try is to turn on debug, I think it doesn't work well with multithread=true
yet, so perhaps switch it off.
getSolverParams(fg).dbg=true
It's also possible that multithread=true
is causing your problem.
Unfortunately as I ran this as a script and not from the REPL (as I would have to look into how to supply all the command line arguments to my script in the REPL) I only have the "pose-iteration" before the failure. I can inform myself (or do you know an easy way how to do this) and do this in a later run and save it out, but I'd wait for the repetition I run now (should be finished in 1-2 hours if it breaks at the same spot) to see if it breaks at the same spot.
Perhaps save the fg before every solve in the script (if I'm understanding you correctly?)
The second run already got past pose 50 it reached in the first one, so my suspicion that this is not related to a particular graph seems true.
Not using my sysimage also did not help, timeout/hangup also occur there.