patRoon
patRoon copied to clipboard
parallelism and timeouts errors
patRoon 1.2.0
As part of my analysis, I was getting random errors while executing this line:
formulas <- generateFormulas(fGroups, "genform", mslists, relMzDev = 5,
adduct = adduct, elements = "CHNOClBrSP",
calculateFeatures = FALSE, featThreshold = 0.75)
and this line:
compounds <- generateCompounds(fGroups, mslists, "metfrag", method = "CL",
dbRelMzDev = 5, fragRelMzDev = 5,
fragAbsMzDev = 0.002, adduct = "[M+H]+",
database = "csv", extraOpts =
list(LocalDatabasePath = myLocalDatabasePath), scoreTypes = c("fragScore","metFusionScore","score", "individualMoNAScore"),
maxCandidatesToStop = 100)
The progress bars would stop at a random point and then rerunning the experiment would sometimes work fine, others crash with a NULL message. The crashes were much more frequent when running in an HPC with lots of cores. After setting patRoon.MP.maxProcs = 1
the random behavior was gone, but still I would get some crashes. I only managed to fix those by noticing the genform binary would take too much time sometimes (fixed with timeout = 5000
), and that metfrag would also take too much time (fixed with errorRetries = 200, timeoutRetries = 20
).
In general I think there seems to be some bug in the implementation of the paralellism part, but a more pressing issue would be to have much better error messages when external commands fail, as in the current state it is really hard to find out what the error was. Is it documented anywhere how to debug such issues?
Hello,
This will probably need a bit of investigation to see what is happening. To start, first a few questions back:
- Did you look at the log files? This is the mechanism used to report any errors with parallel workflow steps. The files are generated in the log directory inside the current working directory.
- I have seen random failures on CI, usually as a result of running out of memory. I guess you are limited by an amount of RAM that you can use? This could explain why disabling parallelization improves things.
- I see that you are using patRoon 1.2.0? Perhaps you could try with the latest version (2.0)?
- I also notice that you are using a custom MetFrag database? It is possible that MetFrag is struggling here, but you probably would find that hopefully back in the log files. One thing you could try is to see if the situation improves with e.g. the PubChemLite library.
Thanks, Rick
Thanks for your prompt reply:
For now I cannot test patRoon 2.0
as we started this project with 1.2
, but in the future I can try that. Nor for the generateFormulas
step, nor the generateCompounds
step I could see any log file, as the only info shown in stdout was NULL
, like a crash happened in an unexpected place. I think I had more than enough RAM (100GB). In our experiments we are trying several databases, that's why we need custom ones, although I could try with PubChemLite
and see if the same error appears.
Ah yes, 100 GB is definitely quite a lot. Although, if I remember correctly, every MetFrag process can take ~1-2 GB RAM, so if your HPC has many cores then the default maxProcs
setting, which equals the core amount, can still eat up quite a bit. I would be curious if lower values for maxProcs
(e.g. <10-20) would also avoid the random errors you are seeing.
Furthermore, are all the log files empty or just the ones for which the tool execution failed? Note that it should report both stdout
and stderr
, so I am a bit surprised it is completely empty ... :-( If you cannot pin-point which input fails then you could fall-back to running subsets of data, e.g.
# only process first 50 fGroups
compounds <- generateCompounds(fGroups[, 1:50], MSPeakLists = mslists, "metfrag", ...)
Hopefully this allows you to find where it goes wrong, and ideally I could then try to reproduce it locally.
PS: I will be away until April after tomorrow, so it might take me some more time to reply...
I did some work to make the problems I found reproducible, so I hope this time they are. Now I am using publicly available data, and the latest versions of every tool (specially patRoon 2.0.1
).
You can find all the details in the repo gilcu3/patRoonTests
. I am running that code in the HPC from my university which has 128GB ram and 128 cores.
In the folder results you can find all logs. I have run it three times, and in all three it crashes while computing the formulas. I am not using the cache so it always fails in more or less the same part, but not exactly the same, which indicates some kind of randomness. When I used the cache I could pass that step and then obtained similar errors in the step using metfrag, but to not overload the issue, I will focus on the genform errors now. Basically the error obtained is double free or corruption (fasttop)
and no log explains why. Let me know if you need additional data.
Hi @gilcu3,
The errors you are seeing indicate that R
itself is crashing, which could be related to faulty memory handling or perhaps a race condition somewhere. I would still be interested if you could re-run your tests with a lower value for patRoon.MP.maxProcs
(e.g. 4, 8, 16). In theory, with the default setting, patRoon
could launch 128 processes simultaneously, which would only leave ~1 GB max for each process, which could be tight.
Unfortunately, I couldn't find much time to test myself and I don't have a access to a machine with such a large core amount as you do. However, your test repos is definitely appreciated. So far, I couldn't reproduce the errors you see, but I would like to investigate more...
Thanks, Rick
Hi @rickhelmus,
I just tried with 16 processes, and it didn't crash in the generateFormulas
, but it is running forever there. I am not using the parameter timeout
so it should use the default, but it seems it is not. Is this correct behavior in version 2.0.1
?
On the other hand, I still think there may be a problem with parallelism. With 128 cores there is much more likelihood of a race condition, and that would be harder to reproduce using only 16. Also I checked the slurm log for some of the previous test, and the maximum amount of memory using was just ~4GB.
JobID JobName ExitCode Group MaxRSS Comment Partition NNodes NCPUS
------------ ---------- -------- --------- ---------- -------------- ---------- -------- ----------
198131 script.R 0:0 clusteru+ batch 1 128
198131.batch batch 0:0 45113436K 1 128
198131.exte+ extern 0:0 0 1 128
If I can help you reproducing this problem in any other way let me know. Thanks