poseidon icon indicating copy to clipboard operation
poseidon copied to clipboard

Installing and running through singularity

Open rresendepinto opened this issue 3 years ago • 29 comments

I understand how to install and run this program with docker but how can I install and run the software without using docker on a HPC cluster which uses singularity?

rresendepinto avatar Feb 28 '22 15:02 rresendepinto

Hey @rresendepinto thx for your interest! This should be fairly simple: -profile slurm,singularity in case your HPC runs SLURM.

Please check also the README! When running on a HPC, I also recommend using --cachedir to store the Singularity images and --workdir or -w to stare the work directories. Can be dangerous on HPC to store such stuff in your home for example w/ limited space.

hoelzer avatar Feb 28 '22 15:02 hoelzer

Thank you for the quick answer! What if the HPC doesnt run SLURM, but SGE instead?

rresendepinto avatar Feb 28 '22 16:02 rresendepinto

Uh I think SGE is not implemented yet but only SLURM and LSF. However, SGE is supported by nextflow: https://www.nextflow.io/docs/latest/executor.html

We can provide a hot fix and then you can give it a try. Unfortunately, we can not test this bc/ no access to a SGE.

@fischer-hub can you add a -profile sge to the nextflow.config please?

hoelzer avatar Feb 28 '22 16:02 hoelzer

If you could do that, it would help me a lot. Thank you!

rresendepinto avatar Feb 28 '22 16:02 rresendepinto

Hey @rresendepinto ! I just added the SGE profile to the workflow on a new branch sge_profile! If you could checkout the branch and tell me if everything is running as you'd expect that would be great, since we don't have access to any machines running SGE as @hoelzer said.

You can activate the SGE profile by adding -profile sge,singularity, -profile sge,conda or -profile sge,docker to your nextflow call depending on which container engine you would like to use!

fischer-hub avatar Mar 01 '22 07:03 fischer-hub

Thx @fischer-hub for the fast fix!

@rresendepinto you can also pull and run the new code on the branch easily via

nextflow pull hoelzer/poseidon
nextflow run hoelzer/poseidon -r sge_profile ...

hoelzer avatar Mar 01 '22 09:03 hoelzer

Hi! I tried the new version but I think there was a conflict with SGE.

Error executing process > 'check_fasta_format (1)'

Caused by:
  Failed to submit process to grid scheduler for execution

Command executed:

  qsub -terse .command.run

Command exit status:
  1

Command output:
  Unable to run job: denied: host "compute-0-12.local" is not a submit host
  Exiting.

The system is CentOS 7 and the SGE version is 8.1.9.

rresendepinto avatar Mar 03 '22 15:03 rresendepinto

Hi @rresendepinto , sadly I can not test this but from your command output I'm wondering wether you are executing the pipeline from a compute node:

Unable to run job: denied: host "compute-0-12.local" is not a submit host

In that case could you try running the pipeline from a node where the qsub command is available? (e.g. the clusters head node) The nextflow documentation mentions this for SGE:

Nextflow manages each process as a separate grid job that is submitted to the cluster by using the qsub command.

Being so, the pipeline must be launched from a node where the qsub command is available, that is, in a common usage scenario, the cluster head node.

Nextflow will then submit the individual processes to the compute nodes itself :)

fischer-hub avatar Mar 03 '22 15:03 fischer-hub

It just hangs indefinitely in the clusters head node. I can't figure out why.

rresendepinto avatar Mar 07 '22 14:03 rresendepinto

Hi! Could you post the command you used and maybe the .nextflow.log file? At what point does the pipeline stop running?

fischer-hub avatar Mar 07 '22 14:03 fischer-hub

Command: NXF_JAVA_HOME=/home/rpinto/Documents/bin/jre1.8.0_321 NXF_VER=20.09.0-edge nextflow run hoelzer/poseidon -r sge_profile --fasta poseidon/test_data/bats_mx1_small.fasta --cores 4 -profile sge,singularity >> qsub2.out &

Nextflow log file nextflow_0703.log

This is the output:

Tree root species: NA Reference species: NA

Use KH-insignificant breakpoints: no

[- ] process > check_fasta_format - [- ] process > translatorx - [- ] process > check_aln -

[- ] process > check_fasta_format - [- ] process > translatorx - [- ] process > check_aln - [- ] process > remove_gaps - [- ] process > raxml_nt - [- ] process > raxml_aa - [- ] process > raxml2drawing - [- ] process > nw_display - [- ] process > barefoot - [- ] process > model_selection - [- ] process > gard_detect - ________________________ Execution status: failed Results are reported here: results/<prefix_of_your_fasta>/html/full_aln/index.html

No results folder is created

rresendepinto avatar Mar 07 '22 15:03 rresendepinto

Thanks @rresendepinto ! Your command seems fine to me, works as expected (with slurm). However I noticed you are running Nextflow version 20.09.0-edge, you can update your Nextflow installation to version 21.10.6 with nextflow self-update. Maybe that will already fix the issue. Otherwise I'm not quite sure whats going on, did you successfully run other nextflow pipelines in the past already? Also, is there anything interesting in the qsub.out log files? Btw, do you see any of the singularity images being downloaded?

fischer-hub avatar Mar 07 '22 18:03 fischer-hub

This is the first nextflow pipeline I am trying to run on this cluster. The qsub file only presents what I reported before as output. The same thing happens with version 21.10.6. The singularity images are being downloaded

rresendepinto avatar Mar 08 '22 11:03 rresendepinto

Alright, you could try to run the nextflow test pipeline, that way we can see if this issue is restricted to poseidon or an issue with the HPC and the nextflow execution:

nextflow run hello

You can expect an output similar to this if everything is working correctly:

N E X T F L O W  ~  version 21.10.6
Launching `nextflow-io/hello` [nice_payne] - revision: ec11eb0ec7 [master]
executor >  local (4)
[fa/20fe7a] process > sayHello (1) [100%] 4 of 4 ✔
Ciao world!

Hello world!

Hola world!

Bonjour world!

Also I found this issue with nextflow on SGE HPCs which kind of sounds like your issue. The problem here was that SGE defaulted to a shell that was not bash, resulting in nextflow crashing. I added their recommended fix to the SGE profile in poseidon, so you could also try pulling the last commit again and running your command. Do you know what kind of shell your HPC is running as default?

fischer-hub avatar Mar 08 '22 12:03 fischer-hub

The hello program works. The HPC runs bash as default.

The pipeline doesn't hang anymore but throws a different error related to the job scheduler:

qsub4.out.txt

rresendepinto avatar Mar 08 '22 14:03 rresendepinto

Okay, great! That means nextflow is running now. From the qsub log file it seems your cluster requires the definition of the parallel environment to use. I added smp as the standard parallel environment to use with the SGE profile here, however it is possible that your cluster is using a different parallel environment. If you are still getting an error similar to

Command output:
  Unable to run job: "job" denied: use parallel environments instead of requesting slots explicitly
  Exiting.

that is referring to the parallel environment you might have to change the penv variable here (in your local installation) from smp to the environment your cluster is using, e.g. something like mpi. But you can just try to run the new commit maybe your cluster is using smp!

fischer-hub avatar Mar 08 '22 15:03 fischer-hub

I think it uses mpi. Where is the pipeline code stored?

rresendepinto avatar Mar 08 '22 15:03 rresendepinto

I changed to mpi but it won't run because it has uncommited changes.

rresendepinto avatar Mar 08 '22 15:03 rresendepinto

I changed to mpi but it won't run because it has uncommited changes.

Maybe you are trying to run the pipeline from this branch here. Can you try to change to your local directory where your poseidon installation is (I think in your log file it was /home/rpinto/.nextflow/assets/hoelzer/poseidon/) and then start the pipeline with:

nextflow run poseidon.nf --fasta poseidon/test_data/bats_mx1_small.fasta --cores 4 -profile sge,singularity

nextflow run hoelzer/poseidon -r sge_profile ... will try to run the sge_profile branch but you actually want to run with your loal changes.

fischer-hub avatar Mar 08 '22 16:03 fischer-hub

Yes exactly, or first clone your own local copy of the PoSeiDon code, make the changes, and then run like described above via nextflow run poseidon.nf ...

git clone https://github.com/hoelzer/poseidon.git
cd poseidon
git checkout sge_profile
# now you are on the code branch w/ the changes David introduced
git pull origin sge_profile
# just in case, check that you really have the changes in this branch
# now modify your local copy according to your needs
nextflow run poseidon.nf --fasta poseidon/test_data/bats_mx1_small.fasta --cores 4 -profile sge,singularity

hoelzer avatar Mar 08 '22 16:03 hoelzer

I ran the pipeline from the local directory and it seems to have solved the issues with the job scheduler. However, it threw an error on the "gard_detect" step. The output is in the following file. I am running the pipeline on the test_data/bats_mx1_small.fasta file.

qsub7.out.txt

rresendepinto avatar Mar 08 '22 16:03 rresendepinto

Glad you got it to run, yes there are a few known issues regarding the gard_detect process, there already is a PR addressing them, however it does not contain the SGE profile changes. Could you provide the .command.log and gard.log files from the working directory so we can figure out if its a known issue? (/home/rpinto/.nextflow/assets/hoelzer/poseidon/work/84/428977a49ec560ffb8bdddfe7eb1b1)

If its the same issue I can just rebase the sge_profile branch so the fix is on there too.

fischer-hub avatar Mar 08 '22 17:03 fischer-hub

.command.log is empty

gard.log

rresendepinto avatar Mar 08 '22 18:03 rresendepinto

.command.log is empty

gard.log

Yep, thats the issue targeted in the gard PR. I rebased the sge_profile branch so the change should be available there too now. Just pull and try again! And don't forget to set your penv in case it got overwritten because of the pull :)

fischer-hub avatar Mar 08 '22 19:03 fischer-hub

I can not pull the pipeline again.

hoelzer/poseidon contains uncommitted changes -- cannot pull from repository

I tried changing the penv variable back to 'smp' but it didn't work

rresendepinto avatar Mar 08 '22 19:03 rresendepinto

Can you try:

git clone https://github.com/hoelzer/poseidon.git
cd poseidon
git checkout sge_profile
# now you are on the code branch w/ the changes David introduced
git pull origin sge_profile
# just in case, check that you really have the changes in this branch
# now modify your local copy according to your needs
nextflow run poseidon.nf --fasta test_data/bats_mx1_small.fasta --cores 4 -profile sge,singularity

that way you pull the changes directly via Git.

fischer-hub avatar Mar 08 '22 19:03 fischer-hub

It finally worked! I can't wait to use this pipeline on my data.

Thank you so much for your help! :)

rresendepinto avatar Mar 09 '22 12:03 rresendepinto

Great! I'm glad to hear that,

thanks for also testing the SGE profile!

fischer-hub avatar Mar 09 '22 13:03 fischer-hub

Awesome! Thanks for your patience @rresendepinto and thanks for the support @fischer-hub !

hoelzer avatar Mar 09 '22 15:03 hoelzer