fcs icon indicating copy to clipboard operation
fcs copied to clipboard

FCS-GX bypasses memory limits set in the bsub command on the LSF platform

Open eeaunin opened this issue 9 months ago • 1 comments

Hello. This issue is a follow-up for issue #69. When running FCS-GX, the LSF logs always underreport how much memory FCS-GX uses. Below is an example of an LSF log from an FCS-GX run with a tiny FASTA file. The run completed successfully.

Successfully completed.

Resource usage summary:

    CPU time :                                   510.08 sec.
    Max Memory :                                 49 MB
    Average Memory :                             42.99 MB
    Total Requested Memory :                     512000.00 MB
    Delta Memory :                               511951.00 MB
    Max Swap :                                   -
    Max Processes :                              16
    Max Threads :                                33
    Run time :                                   518 sec.
    Turnaround time :                            520 sec.

The output (if any) is above this job summary.

The log says max memory use was 49 Mb but I think this is not true and FCS-GX really uses at least 470 Gb memory for each run. When submitting LSF jobs, there is a maximum memory use limit set in the bsub command, e.g. bsub -n1 -R"span[hosts=1]" -M5000 -R 'select[mem>5000] rusage[mem=5000]', and normally LSF terminates jobs that go over this this limit and emits the TERM_MEMLIMIT: job killed after reaching LSF memory usage limit message. However, this doesn't work properly with FCS-GX. FCS-GX ignores the memory limit set by the user in the bsub command. It uses ~470 Gb memory of the compute node anyway, regardless of whether the LSF job's memory limit permits it or not. When this causes the compute node to run out of memory, Linux scheduler has to kill the FCS-GX process. I don't know any other software that behaves like this on the LSF platform. Do you know what causes this and if there is a way to fix it?

These are the software versions used in my recent runs: OS: Ubuntu 22.04.4 LTS Singularity: v3.11.4 FCS image: 0.5.0 Python: 3.8.12 Platform: LSF

eeaunin avatar May 09 '24 04:05 eeaunin

The job summary shows correct statistics.

Although GX memory-maps the large .gxi and .gxs files which are required to be in physical memory to avoid thrashing (major page-faults), any particular execution of GX is accessing only a portion of those files, which depends on the input genome, so only a portion of those files is mapped onto GX process' virtual address space, which is what we see in the output of LSF job summary.

etvedte avatar May 10 '24 18:05 etvedte

Hi @etvedte . However, FCS is then sometimes killed by the kernel (out-of-memory killer) with the logs explaining it was selected due to using >460 GB RAM. So there is a discrepancy between RAM usage as reported by VmRSS/LSF and the kernel

muffato avatar May 14 '24 15:05 muffato

We are not sure exactly how to interpret the discrepancy that LSF is showing. It is possible that this is measuring the RAM used by the python runner script and not the entire process tree. Nevertheless, GX would not be aware nor adhere to memory limits set in LSF or system policies.

etvedte avatar May 14 '24 18:05 etvedte

We've added some memory stats printouts that can be triggered by running fcs.py --debug. This might help to clarify any issues you are having.

Hopefully with your adjusted configuration and v0.5.4 you are seeing fewer jobs terminated.

etvedte avatar Jun 26 '24 15:06 etvedte