poseidon
poseidon copied to clipboard
GARD fails due to MPI setup (?)
Hi!
I just tried to run the pipeline with profile local and singularity, with the test data bats_mx1_small.fasta
. However GARD fails, apparently due some MPI setup stuff. I'm not sure if that should all happen in the container or if I have to deal with configuring that to run on the server/cluster?
This is gard.log
:
libi40iw-i40iw_ucreate_qp: failed to create QP, unsupported QP type: 0x4
--------------------------------------------------------------------------
Failed to create a queue pair (QP):
Hostname: host
Requested max number of outstanding WRs in the SQ: 1
Requested max number of outstanding WRs in the RQ: 2
Requested max number of SGEs in a WR in the SQ: 1023
Requested max number of SGEs in a WR in the RQ: 1023
Requested max number of data that can be posted inline to the SQ: 0
Error: File exists
Check requested attributes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.
You job will continue, but Open MPI will ignore the "ud" oob component
in this run.
Hostname: host
--------------------------------------------------------------------------
libi40iw-i40iw_ucreate_qp: failed to create QP, unsupported QP type: 0x4
--------------------------------------------------------------------------
Failed to create a queue pair (QP):
Hostname: host
Requested max number of outstanding WRs in the SQ: 1
Requested max number of outstanding WRs in the RQ: 2
Requested max number of SGEs in a WR in the SQ: 1023
Requested max number of SGEs in a WR in the RQ: 1023
Requested max number of data that can be posted inline to the SQ: 0
Error: File exists
Check requested attributes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.
You job will continue, but Open MPI will ignore the "ud" oob component
in this run.
Hostname: host
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: host
Device name: i40iw0
Device vendor ID: 0x8086
Device vendor part ID: 14290
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: host
Local device: i40iw0
Local port: 1
CPCs attempted: udcm
--------------------------------------------------------------------------
[ERROR] This analysis requires an MPI environment to run
[host:1017209] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[host:1017209] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[host:1017209] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
Hi @Shellfishgene , thanks for your interest in the pipeline!
All should happen inside the container: but it seems there is some issue with the Singularity container version for GARD+MPI. I will try to look into it asap
I guess you have no way on your cluster to run the Docker profile?
No Docker on the cluster, I can run it on a workstation though. It's not urgent anyway... Thanks for having a look!
Getting similar problem, different log with singularity:
Failed to create a completion queue (CQ):
Hostname: endeavour2 Requested CQE: 16384 Error: Cannot allocate memory
Check the CQE attribute.
Open MPI has detected that there are UD-capable Verbs devices on your system, but none of them were able to be setup properly. This may indicate a problem on this system.
You job will continue, but Open MPI will ignore the "ud" oob component in this run.
Hostname: endeavour2
Failed to create a completion queue (CQ):
Hostname: endeavour2 Requested CQE: 16384 Error: Cannot allocate memory
Check the CQE attribute.
Open MPI has detected that there are UD-capable Verbs devices on your system, but none of them were able to be setup properly. This may indicate a problem on this system.
You job will continue, but Open MPI will ignore the "ud" oob component in this run.
Hostname: endeavour2
No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port.
Local host: endeavour2 Local device: mlx4_0 Local port: 1 CPCs attempted: udcm
[ERROR] This analysis requires an MPI environment to run
[endeavour2.hpc.usc.edu:161337] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
Hey @Shellfishgene!
Am I understanding it right that this issue occured to you when you were running poseidon
on your local machine with the singularity
profile? Because then I can't seem to recreate it. Runs fine for me with bats_mx1_small.fasta
.
Did you try to just run the pipeline again or with the -resume
flag turned on? Also are you running the latest release of poseidon
?
Hi!
I just tried to run the pipeline with profile local and singularity, with the test data
bats_mx1_small.fasta
. However GARD fails, apparently due some MPI setup stuff. I'm not sure if that should all happen in the container or if I have to deal with configuring that to run on the server/cluster? This isgard.log
:
Hey @Shellfishgene! Am I understanding it right that this issue occured to you when you were running
poseidon
on your local machine with thesingularity
profile? Because then I can't seem to recreate it. Runs fine for me withbats_mx1_small.fasta
.
I figured out what the problem was: I forgot to set the local
profile in Nextflow, and ran it with -profile singularity --cores 4
. However that seems to set ${task.cpus}
to 1 for the gard task, and mpirun -np 1
causes the error. It needs to be >1. The error message from mpirun is not exacly clear... With -profile local,singularity
it works.
@Shellfishgene ah great, thanks for letting us know!
So it seems that when no "execution" profile is defined, the default core number as defined here: https://github.com/hoelzer/poseidon/blob/master/nextflow.config#L15
is not distributed to the processes.
With -profile local,singularity
the default value is passed to the GARD process:
https://github.com/hoelzer/poseidon/blob/master/configs/local.config#L14
@fischer-hub maybe we can just add a check to the poseidon.nf
that the task.cpus
must be >1?
@hoelzer Yes good idea probably, I also ran into some other issues with the gard process when running with --profile slurm,singularity
, might as well fix all of that together!