RoME.jl icon indicating copy to clipboard operation
RoME.jl copied to clipboard

Running Rome in multiple julia instances freezes solving

Open lemauee opened this issue 3 years ago • 73 comments

Hi,

This is a issue very specific to experiments for my thesis/application:

As I want to perform a lot of individual solves for my thesis, I run multiple julia instances (each using multiple workers and threads) on a pretty beefy machine. This seems to be faster than running them sequentially, because parallelism can't be exploited everywhere. It apparently worked fine until the RoME v0.13 release (I think). Now the solver just "hangs up" in all but one instance (this happens at a different pose for every time I try).

This is the last thing that gets written to stdout (aka my logs ;)):

Solve Progress: approx max 1752, at iter 711 	 Time: 0:01:50[K[ Info: CSM-5 Clique 47 finished

Solve Progress: approx max 1752, at iter 716 	 Time: 0:01:51[K

I run the multiple julia instances each in their own screen session like

screen -S <sessionname>

And then fire up my my evaluation script that reads my dataset from a matfile and takes some other arguments (command gets auto-generated by my matlab frontend)

julia -t auto -p 8 --project=Masterarbeit/svn/julia -J ~/.julia/sysimage_RoME.so -- Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl --path tmp --file sigmaTrajTf-0-001-0-000-0-005_sigmaLmBearingAndRanging-0-004-0-001_wrongRatio-1-000_minDist-0-200_maxDist-2-000.mat --trajKeys tf --lmKeys bearingAndRanging --startPoseIdx 1 --endPoseIdx 150 --startPoseVal "[7.15793412168699;3.38794983135837;-1.36840501375598]" --plotSaveFinal 1 --plotSaveIter 1 --nRuns 1 --suffix variables --useMsgLikelihoods 0 --nullhypo 0.000000 --tukey 15.000000 --nKernels 100 --spreadNH 3.000000 --inflation 3.000000

I will resort to running things sequentally (as any online application would), but especcialy for tuning out parameters like spreadNH or inflation running things in parallel was very useful, as I don't have access to an infinite amount of machines ;) As multiprocess/multithread performance increased in the last releases, I suspect that something (maybe not even directly RoME-related, but in Julia general) gets in each others way, some lock does not get released.

Best, Leo

lemauee avatar Feb 25 '21 10:02 lemauee

Hi @lemauee ,

Oops that is odd. This might be something upstream. We'd have to look a bit more at what's happening. In the mean time, perhaps we can help compute results on local servers. Do you perhaps want to send tar.gz files of graphs you want solved and I can run on compute power here. Will return the result in new tar.gz. Perhaps in the process we start getting the same error here. I'm quite surprised a multi-Julia issue might be happening. If not there then something in Distributed.jl perhaps. Either way, we'd likely have to fix locally in the short term.

Sure you already using as such, but just in case: You can generate with fg = initfg(); getSolverParams(fg).graphinit=false and then the generation of graphs should be pretty quick. Solving can then use either graphinit (might not follow exact sequence) or treeinit (our desired but still experimental) method at later solve.

Lastly, I just merged (what I expect) the fix for your issue #378. Will tag IIF v0.21.3 in a few hours.

cc, might be worth @jim-hill-r (hi!) to look out for something like this multiple Julia's issue too.

Best, Dehann

dehann avatar Feb 25 '21 11:02 dehann

Hi @dehann,

I don't know if sending you the tar.gz would help to reproduce the issue as expected, as I'm still doing a solve for every pose (I know that this is slow, but consistency to the online-approach we use in classical parametric fgs was important to us).

However, I'd happily send you the Julia-Project needed to run my stuff, the matfile and the commands to invoke the solve. Otherwise I could also set up the stuff through ssh on your server and then see if i can reproduce it there.

Best, Leo

lemauee avatar Feb 25 '21 11:02 lemauee

The same hangup just happened when running only one julia instance, so this might not even be multiinstance-related (but the test yesterday suggested so...). Also, it occured on one of my private and not the universitys pcs. I recently did a fresh ubuntu install there, so nothing strange in the environment should hinder things there. Maybe its also worth to give it a try without a precompiled sysimage as I once had problems with RoMEPlotting as seen in JuliaRobotics/RoMEPlotting.jl#140 as you know.

Best, Leo

lemauee avatar Feb 25 '21 12:02 lemauee

Whats also notable is that there is absolutely nothing executed (no computational load) present when the hangup occurs (no CPU load in top or something similar can be seen).

lemauee avatar Feb 25 '21 12:02 lemauee

Just a sanity check question, did you update the precompiled sysimage with the latest tags?

Other than that, do you know what clique is stalling? One way to find out is to draw the tree (drawtree=true, showtree=true) and see at what clique it is waiting on. Then you can try turning on debug messages for that clique CSM, say for 2: ENV["JULIA_DEBUG"] = :csm_2 And post results here please.

Affie avatar Feb 25 '21 12:02 Affie

I just compiled the sysimage this morning using RoME v0.13.0 and IIF v0.21.2 . The problem is that the clique it stopped on was another one everytime I think, but still need to confirm that. Some examples even ran through that just differed by the amount of noise or even just the setting of the spreadNH parameter and still "made it differently far". I just started a run without the compiled sysimage to roule out that source of failure. To draw the tree I'll need a machine with x-server to start some pdf viewer, right?

lemauee avatar Feb 25 '21 12:02 lemauee

To draw the tree I'll need a machine with x-server to start some pdf viewer, right?

You should be able to set drawtree=true and then open the tree over SSH, I haven't done it before, so just guessing.

Edit: The tree should be at the logger file location or /tmp/caesar

Affie avatar Feb 25 '21 12:02 Affie

You could also try setting a timeout and then seeing if there is some more information on where it timed out, eg: solveTree!(fg, timeout=70)

Affie avatar Feb 25 '21 12:02 Affie

timeout is in seconds I suppose? Then I should give it half an hour (1800s) or something for my example to be safe I'd say ;) I will start an example with and without sysimage inclding drawtree, showtree and timeout now and report back.

lemauee avatar Feb 25 '21 12:02 lemauee

Seconds, yes. It is per step for every clique individually, so should be fine lower.

Affie avatar Feb 25 '21 13:02 Affie

So lets make it 5 minutes, 300s to be save not to kill something prematurely

lemauee avatar Feb 25 '21 13:02 lemauee

There are a few tools that we use to debug the CSM, you can also try this one: https://github.com/JuliaRobotics/IncrementalInference.jl/issues/443#issuecomment-699843926

Affie avatar Feb 25 '21 13:02 Affie

Just got an error when trying to run with drawtree, showtree and timeout:

ERROR: LoadError: MethodError: no method matching solveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree; multithread=true, drawtree=true, showtree=true, timeout=300)
Closest candidates are:
  solveTree!(::AbstractDFG, ::AbstractBayesTree; timeout, storeOld, verbose, verbosefid, delaycliqs, recordcliqs, limititercliqs, injectDelayBefore, skipcliqids, eliminationOrder, variableOrder, eliminationConstraints, variableConstraints, smtasks, dotreedraw, runtaskmonitor, algorithm, multithread) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:258 got unsupported keyword arguments "drawtree", "showtree"
  solveTree!(::AbstractDFG) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:258 got unsupported keyword arguments "multithread", "drawtree", "showtree", "timeout"
Stacktrace:
 [1] kwerr(::NamedTuple{(:multithread, :drawtree, :showtree, :timeout),Tuple{Bool,Bool,Bool,Int64}}, ::Function, ::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree) at ./error.jl:157

lemauee avatar Feb 25 '21 13:02 lemauee

Sorry should have been clearer:

getSolverParams(fg).drawtree = true
getSolverParams(fg).showtree = true

Affie avatar Feb 25 '21 13:02 Affie

There are a few tools that we use to debug the CSM, you can also try this one: JuliaRobotics/IncrementalInference.jl#443 (comment)

This will be difficult if julia "hangs up" right? I dont now how to do the fetching step after the repl is frozen (i would even need to find a way to run my script in the repl with command line arguments first).

lemauee avatar Feb 25 '21 13:02 lemauee

If the timeout is triggered it should stop everything and continue with the next lines.

Affie avatar Feb 25 '21 13:02 Affie

How hard are the performance hits of enabling this? Is there a good way to write this debugging info to file to include it into my normal workflow? Or how to trigger this best on the timeout kicking in?

lemauee avatar Feb 25 '21 13:02 lemauee

I don't think you want it in normal operation. You can remove the recordcliqs=ls(fg) or just record a few if you know where the problem is.

Affie avatar Feb 25 '21 13:02 Affie

You can add this without a performance hit: https://github.com/JuliaRobotics/IncrementalInference.jl/issues/443#issuecomment-785895964

Affie avatar Feb 25 '21 13:02 Affie

Opening the bt.pdf (using evince) to see the solve progress currently fails on my machine:

Der Dateityp Graphviz-DOT-Graph (text/vnd.graphviz) wird nicht unterstützt

Seems like its no pdf but a dot file.

lemauee avatar Feb 25 '21 13:02 lemauee

The dot file using the dot viewer seems to be working though :)

lemauee avatar Feb 25 '21 13:02 lemauee

Also updates fine through sshfs

lemauee avatar Feb 25 '21 13:02 lemauee

You need graphviz and xdot, https://juliarobotics.org/Caesar.jl/latest/installation_environment/#Local-Dependencies

Affie avatar Feb 25 '21 13:02 Affie

Its clearly reproducible when running only one julia instance, I suspect using more workers makes this appear more often. Tree: Bildschirmfoto von 2021-02-25 15-46-11

cliq4_stacktrace:

InterruptException:
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
 [4] _wait(::Task) at ./task.jl:238
 [5] sync_end(::Channel{Any}) at ./task.jl:294
 [6] macro expansion at ./task.jl:333 [inlined]
 [7] solveCliqDownFrontalProducts!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::IncrementalInference.TreeClique, ::SolverParams, ::Base.CoreLogging.SimpleLogger; MCIters::Int64) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:491
 [8] solveCliqDownFrontalProducts! at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:469 [inlined]
 [9] solveDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:654
 [10] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
 [11] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
 [12] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
 [13] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169

cliq4_csm.txt is empty

cliq14_stacktrace:

InterruptException:
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{ReentrantLock}) at ./condition.jl:106
 [4] take_unbuffered(::Channel{LikelihoodMessage}) at ./channels.jl:405
 [5] take!(::Channel{LikelihoodMessage}) at ./channels.jl:381
 [6] takeBeliefMessageDown!(::MetaBayesTree, ::LightGraphs.SimpleGraphs.SimpleEdge{Int64}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/TreeMessageAccessors.jl:115
 [7] waitForDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:467
 [8] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
 [9] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
 [10] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
 [11] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169

cliq14_csm.txt is empty

cliq29_stacktrace:

InterruptException:
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{ReentrantLock}) at ./condition.jl:106
 [4] take_unbuffered(::Channel{LikelihoodMessage}) at ./channels.jl:405
 [5] take!(::Channel{LikelihoodMessage}) at ./channels.jl:381
 [6] takeBeliefMessageDown!(::MetaBayesTree, ::LightGraphs.SimpleGraphs.SimpleEdge{Int64}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/TreeMessageAccessors.jl:115
 [7] waitForDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:467
 [8] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
 [9] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
 [10] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
 [11] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169

cliq29_csm.txt is empty

last lines of stdout:

f3b00c5b/etc/fonts/conf.d/80-delicious.conf", line 6: invalid attribute 'version'
Fontconfig warning: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 4: unknown element "its:rules"
Fontconfig warning: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 5: unknown element "its:translateRule"
Fontconfig error: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 5: invalid attribute 'translate'
Fontconfig error: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 5: invalid attribute 'selector'
Fontconfig error: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 6: invalid attribute 'xmlns:its'
Fontconfig error: "/home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/conf.d/90-synthetic.conf", line 6: invalid attribute 'version'
Fontconfig error: Cannot load config file from /home/lemau/.julia/artifacts/69ab5e1318fa87cac480350ccc9faffff3b00c5b/etc/fonts/fonts.conf
┌ Error: Task 14 failed, sending error to all cliques
└ @ IncrementalInference ~/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:183
[ Info: All cliques should have exited
┌ Error: Task 29 failed, sending error to all cliques
└ @ IncrementalInference ~/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:183
[ Info: All cliques should have exited

InterruptException:
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
 [4] _wait(::Task) at ./task.jl:238
 [5] sync_end(::Channel{Any}) at ./task.jl:294
 [6] macro expansion at ./task.jl:333 [inlined]
 [7] solveCliqDownFrontalProducts!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::IncrementalInference.TreeClique, ::SolverParams, ::Base.CoreLogging.SimpleLogger; MCIters::Int64) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:491
 [8] solveCliqDownFrontalProducts! at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:469 [inlined]
 [9] solveDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:654
 [10] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
 [11] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
 [12] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
 [13] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169┌ Warning: printCliqHistorySummary -- No CSM history found.
└ @ IncrementalInference ~/.julia/packages/IncrementalInference/nmcd8/src/TreeDe
bugTools.jl:211
ERROR: LoadError: TaskFailedException:
InterruptException:
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled)) at ./task.jl:656
 [2] wait at ./task.jl:713 [inlined]
 [3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
 [4] _wait(::Task) at ./task.jl:238
 [5] sync_end(::Channel{Any}) at ./task.jl:294
 [6] macro expansion at ./task.jl:333 [inlined]
 [7] solveCliqDownFrontalProducts!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::IncrementalInference.TreeClique, ::SolverParams, ::Base.CoreLogging.SimpleLogger; MCIters::Int64) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:491
 [8] solveCliqDownFrontalProducts! at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqStateMachineUtils.jl:469 [inlined]
 [9] solveDown_StateMachine(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:654
 [10] (::StateMachine{CliqStateMachineContainer})(::CliqStateMachineContainer{BayesTreeNodeData,LightDFG{SolverParams,DFGVariable,DFGFactor},LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree}, ::Int64; pollinterval::Float64, breakafter::Function, verbose::Bool, verbosefid::Base.TTY, verboseXtra::IncrementalInference.CliqStatus, iterlimit::Int64, injectDelayBefore::Nothing, recordhistory::Bool, housekeeping_cb::IncrementalInference.var"#382#384"{IncrementalInference.TreeClique}) at /home/lemau/.julia/packages/FunctionalStateMachine/2JZFG/src/StateMachine.jl:94
 [11] initStartCliqStateMachine!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::IncrementalInference.TreeClique, ::Int64; oldcliqdata::BayesTreeNodeData, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, show::Bool, incremental::Bool, limititers::Int64, upsolve::Bool, downsolve::Bool, recordhistory::Bool, delay::Bool, logger::Base.CoreLogging.SimpleLogger, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/CliqueStateMachine.jl:63
 [12] tryCliqStateMachineSolve!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64, ::Int64; oldtree::MetaBayesTree, verbose::Bool, verbosefid::Base.TTY, drawtree::Bool, limititers::Int64, downsolve::Bool, incremental::Bool, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, solve_progressbar::ProgressMeter.ProgressUnknown, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:110
 [13] (::IncrementalInference.var"#438#441"{MetaBayesTree,Bool,Bool,Base.TTY,Int64,Bool,Bool,Array{Symbol,1},Array{Symbol,1},Symbol,LightDFG{SolverParams,DFGVariable,DFGFactor},MetaBayesTree,Int64,ProgressMeter.ProgressUnknown,Int64})() at ./threadingconstructs.jl:169

...and 2 more exception(s).

Stacktrace:
 [1] sync_end(::Channel{Any}) at ./task.jl:314
 [2] macro expansion at ./task.jl:333 [inlined]
 [3] taskSolveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree, ::Int64; oldtree::MetaBayesTree, drawtree::Bool, verbose::Bool, verbosefid::Base.TTY, limititers::Int64, limititercliqs::Array{Pair{Symbol,Int64},1}, downsolve::Bool, incremental::Bool, multithread::Bool, skipcliqids::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, delaycliqs::Array{Symbol,1}, smtasks::Array{Task,1}, algorithm::Symbol) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:49
 [4] solveTree!(::LightDFG{SolverParams,DFGVariable,DFGFactor}, ::MetaBayesTree; timeout::Int64, storeOld::Bool, verbose::Bool, verbosefid::Base.TTY, delaycliqs::Array{Symbol,1}, recordcliqs::Array{Symbol,1}, limititercliqs::Array{Pair{Symbol,Int64},1}, injectDelayBefore::Nothing, skipcliqids::Array{Symbol,1}, eliminationOrder::Nothing, variableOrder::Nothing, eliminationConstraints::Array{Symbol,1}, variableConstraints::Nothing, smtasks::Array{Task,1}, dotreedraw::Array{Int64,1}, runtaskmonitor::Bool, algorithm::Symbol, multithread::Bool) at /home/lemau/.julia/packages/IncrementalInference/nmcd8/src/SolverAPI.jl:371
 [5] macro expansion at /home/lemau/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:163 [inlined]
 [6] macro expansion at ./timing.jl:233 [inlined]
 [7] top-level scope at /home/lemau/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:159
 [8] include(::Function, ::Module, ::String) at ./Base.jl:380
 [9] include(::Module, ::String) at ./Base.jl:368
 [10] exec_options(::Base.JLOptions) at ./client.jl:296
 [11] _start() at ./client.jl:506
in expression starting at /home/lemau/Masterarbeit/svn/julia/mmiSAM/evaluation/solve2DIncremental.jl:128
lemau@joule:~$ 

I will run it again and see if it fails at a different spot.

lemauee avatar Feb 25 '21 14:02 lemauee

Can you upload the saved factor graph for the above example and I'll see if I can reproduce it.

Affie avatar Feb 25 '21 15:02 Affie

It is clique 4 that is timing out on the down solve. We haven't seen something like this before.

Affie avatar Feb 25 '21 15:02 Affie

Unfortunately as I ran this as a script and not from the REPL (as I would have to look into how to supply all the command line arguments to my script in the REPL) I only have the "pose-iteration" before the failure. I can inform myself (or do you know an easy way how to do this) and do this in a later run and save it out, but I'd wait for the repetition I run now (should be finished in 1-2 hours if it breaks at the same spot) to see if it breaks at the same spot.

lemauee avatar Feb 25 '21 15:02 lemauee

Another thing you could try is to turn on debug, I think it doesn't work well with multithread=true yet, so perhaps switch it off.

getSolverParams(fg).dbg=true

It's also possible that multithread=true is causing your problem.

Affie avatar Feb 25 '21 15:02 Affie

Unfortunately as I ran this as a script and not from the REPL (as I would have to look into how to supply all the command line arguments to my script in the REPL) I only have the "pose-iteration" before the failure. I can inform myself (or do you know an easy way how to do this) and do this in a later run and save it out, but I'd wait for the repetition I run now (should be finished in 1-2 hours if it breaks at the same spot) to see if it breaks at the same spot.

Perhaps save the fg before every solve in the script (if I'm understanding you correctly?)

Affie avatar Feb 25 '21 15:02 Affie

The second run already got past pose 50 it reached in the first one, so my suspicion that this is not related to a particular graph seems true.

Not using my sysimage also did not help, timeout/hangup also occur there.

lemauee avatar Feb 25 '21 16:02 lemauee