patRoon icon indicating copy to clipboard operation
patRoon copied to clipboard

parallelism and timeouts errors

Open gilcu3 opened this issue 2 years ago • 6 comments

patRoon 1.2.0

As part of my analysis, I was getting random errors while executing this line:

formulas <- generateFormulas(fGroups, "genform", mslists, relMzDev = 5,
								 adduct = adduct, elements = "CHNOClBrSP",
								 calculateFeatures = FALSE, featThreshold = 0.75)

and this line:

compounds <- generateCompounds(fGroups, mslists, "metfrag", method = "CL",
								   dbRelMzDev = 5, fragRelMzDev = 5, 
								   fragAbsMzDev = 0.002, adduct = "[M+H]+", 
								   database = "csv", extraOpts = 
									 list(LocalDatabasePath = myLocalDatabasePath), scoreTypes = c("fragScore","metFusionScore","score", "individualMoNAScore"),
								   maxCandidatesToStop = 100)

The progress bars would stop at a random point and then rerunning the experiment would sometimes work fine, others crash with a NULL message. The crashes were much more frequent when running in an HPC with lots of cores. After setting patRoon.MP.maxProcs = 1 the random behavior was gone, but still I would get some crashes. I only managed to fix those by noticing the genform binary would take too much time sometimes (fixed with timeout = 5000), and that metfrag would also take too much time (fixed with errorRetries = 200, timeoutRetries = 20).

In general I think there seems to be some bug in the implementation of the paralellism part, but a more pressing issue would be to have much better error messages when external commands fail, as in the current state it is really hard to find out what the error was. Is it documented anywhere how to debug such issues?

gilcu3 avatar Mar 03 '22 08:03 gilcu3

Hello,

This will probably need a bit of investigation to see what is happening. To start, first a few questions back:

  • Did you look at the log files? This is the mechanism used to report any errors with parallel workflow steps. The files are generated in the log directory inside the current working directory.
  • I have seen random failures on CI, usually as a result of running out of memory. I guess you are limited by an amount of RAM that you can use? This could explain why disabling parallelization improves things.
  • I see that you are using patRoon 1.2.0? Perhaps you could try with the latest version (2.0)?
  • I also notice that you are using a custom MetFrag database? It is possible that MetFrag is struggling here, but you probably would find that hopefully back in the log files. One thing you could try is to see if the situation improves with e.g. the PubChemLite library.

Thanks, Rick

rickhelmus avatar Mar 03 '22 13:03 rickhelmus

Thanks for your prompt reply: For now I cannot test patRoon 2.0 as we started this project with 1.2, but in the future I can try that. Nor for the generateFormulas step, nor the generateCompounds step I could see any log file, as the only info shown in stdout was NULL, like a crash happened in an unexpected place. I think I had more than enough RAM (100GB). In our experiments we are trying several databases, that's why we need custom ones, although I could try with PubChemLite and see if the same error appears.

gilcu3 avatar Mar 03 '22 13:03 gilcu3

Ah yes, 100 GB is definitely quite a lot. Although, if I remember correctly, every MetFrag process can take ~1-2 GB RAM, so if your HPC has many cores then the default maxProcs setting, which equals the core amount, can still eat up quite a bit. I would be curious if lower values for maxProcs (e.g. <10-20) would also avoid the random errors you are seeing.

Furthermore, are all the log files empty or just the ones for which the tool execution failed? Note that it should report both stdout and stderr, so I am a bit surprised it is completely empty ... :-( If you cannot pin-point which input fails then you could fall-back to running subsets of data, e.g.

# only process first 50 fGroups
compounds <- generateCompounds(fGroups[, 1:50], MSPeakLists = mslists, "metfrag", ...)

Hopefully this allows you to find where it goes wrong, and ideally I could then try to reproduce it locally.

PS: I will be away until April after tomorrow, so it might take me some more time to reply...

rickhelmus avatar Mar 08 '22 15:03 rickhelmus

I did some work to make the problems I found reproducible, so I hope this time they are. Now I am using publicly available data, and the latest versions of every tool (specially patRoon 2.0.1).

You can find all the details in the repo gilcu3/patRoonTests. I am running that code in the HPC from my university which has 128GB ram and 128 cores.

In the folder results you can find all logs. I have run it three times, and in all three it crashes while computing the formulas. I am not using the cache so it always fails in more or less the same part, but not exactly the same, which indicates some kind of randomness. When I used the cache I could pass that step and then obtained similar errors in the step using metfrag, but to not overload the issue, I will focus on the genform errors now. Basically the error obtained is double free or corruption (fasttop) and no log explains why. Let me know if you need additional data.

gilcu3 avatar Mar 16 '22 14:03 gilcu3

Hi @gilcu3,

The errors you are seeing indicate that R itself is crashing, which could be related to faulty memory handling or perhaps a race condition somewhere. I would still be interested if you could re-run your tests with a lower value for patRoon.MP.maxProcs (e.g. 4, 8, 16). In theory, with the default setting, patRoon could launch 128 processes simultaneously, which would only leave ~1 GB max for each process, which could be tight.

Unfortunately, I couldn't find much time to test myself and I don't have a access to a machine with such a large core amount as you do. However, your test repos is definitely appreciated. So far, I couldn't reproduce the errors you see, but I would like to investigate more...

Thanks, Rick

rickhelmus avatar Apr 21 '22 10:04 rickhelmus

Hi @rickhelmus, I just tried with 16 processes, and it didn't crash in the generateFormulas, but it is running forever there. I am not using the parameter timeout so it should use the default, but it seems it is not. Is this correct behavior in version 2.0.1?

On the other hand, I still think there may be a problem with parallelism. With 128 cores there is much more likelihood of a race condition, and that would be harder to reproduce using only 16. Also I checked the slurm log for some of the previous test, and the maximum amount of memory using was just ~4GB.

       JobID    JobName ExitCode     Group     MaxRSS        Comment  Partition   NNodes      NCPUS 
------------ ---------- -------- --------- ---------- -------------- ---------- -------- ---------- 
198131         script.R      0:0 clusteru+                                batch        1        128 
198131.batch      batch      0:0            45113436K                                  1        128 
198131.exte+     extern      0:0                    0                                  1        128

If I can help you reproducing this problem in any other way let me know. Thanks

gilcu3 avatar Apr 21 '22 18:04 gilcu3