goci icon indicating copy to clipboard operation
goci copied to clipboard

Slurm Cluster Migration for Python Infrastructure

Open sprintell opened this issue 1 year ago • 19 comments

  • [ ] gwas-sumstats-harmoniser

  • [x] Summary statistics with HDF5

  • [x] Summary Statistics File Validator

  • [x] gwas-sumstats-tools

  • [x] sum-stats-formatter

  • [x] eQTL-SumStats

  • [x] gwas-template-services

  • [x] gwas-sumstats-service

  • [x] gwas-utils

  • [x] gwas-curation-utils

  • [x] gwas-ebi-search-index

  • [x] gwas-solr-slim

sprintell avatar Oct 26 '23 10:10 sprintell

@karatugo should have a session with @jdhayhurst before commencing this

sprintell avatar Nov 01 '23 11:11 sprintell

Do harmoniser last so Yue can have time to complete her work

ljwh2 avatar Nov 15 '23 11:11 ljwh2

Repo PR Status Notes
gwas-sumstats-harmoniser https://github.com/EBISPOT/gwas-sumstats-harmoniser/pull/82 & https://github.com/EBISPOT/gwas-sumstats-harmoniser/pull/83 & https://github.com/EBISPOT/gwas-utils/pull/159 Done. Release needed for harmoniser & PRE_GWAS-SSF harmoniser. 1) Yue suggested 48h time limit in SLURM. 2) Development done, testing done in sandbox by Yue, pull requests for harmoniser and pre-gwas-ssf harmoniser merged to respective main branches. Glue scripts migrated to SLURM and added to GitHub for better tracking.
Summary statistics with HDF5 Skipped Discussed with Yomi and we agreed not to invest time in this as it will be replaced by another technology soon.
Summary Statistics File Validator Skipped Skipped as it was deprecated
gwas-sumstats-tools Done No LSF usage was found
sum-stats-formatter https://github.com/EBISPOT/sum-stats-formatter/pull/86 Done Merged with the temp sbactch script file implementation and created the following backlog item. https://github.com/EBISPOT/sum-stats-formatter/issues/88
eQTL-SumStats Skipped Postponed. Will check in the next release cycle if it needs an update.
gwas-template-services Done No LSF usage was found
gwas-sumstats-service https://github.com/EBISPOT/gwas-sumstats-service/pull/273 & https://github.com/EBISPOT/gwas-sumstats-service/pull/274 & https://github.com/EBISPOT/gwas-sumstats-service/pull/275 & https://github.com/EBISPOT/gwas-sumstats-service/pull/276 Done. Need to do tag release for the migration. Test OK for Celery workers start and refresh with scrontab. Created new START_CELERY_WORKERS_SLURM.sh in dev and prod. Also new start_celery_worker_slurm.sh script in dev and prod. Tested OK in the sandbox env.
gwas-utils https://github.com/EBISPOT/gwas-utils/pull/158 Done LSF is not used anymore, cleaned up the old LSF code
gwas-curation-utils Done No LSF usage was found
gwas-ebi-search-index Done No LSF usage was found
gwas-solr-slim https://github.com/EBISPOT/gwas-solr-slim/pull/52 Done. AFAIK no releases used but the new script start_slurm.sh. Test OK in dev. Also, created ${bamboo.sw_dir}/${bamboo.env_dir}/scripts/gwas-solr-slim/start_slurm.sh.

karatugo avatar Nov 21 '23 08:11 karatugo

All done. Releases needed for the migration to SLURM.

karatugo avatar Jan 15 '24 16:01 karatugo

This wil be released wiht metadata Yaml Update Feature

sprintell avatar Feb 21 '24 10:02 sprintell

Error in SLURM - waiting for input from TSC

ljwh2 avatar Mar 06 '24 10:03 ljwh2

Released https://github.com/EBISPOT/gwas-sumstats-harmoniser/releases/tag/v1.0.5 and https://github.com/EBISPOT/gwas-sumstats-harmoniser/releases/tag/v1.1.4

karatugo avatar Mar 18 '24 17:03 karatugo

Prepared scrontab entries for harmoniser.

  • [ ] Enable them before deployment
  • [ ] Disable crontab entries also

karatugo avatar Mar 18 '24 18:03 karatugo

For gwas-sumstats-harmoniser migration:

  • Moved crontab items to scrontab
  • Released https://github.com/EBISPOT/gwas-sumstats-harmoniser/releases/tag/v1.0.5 and https://github.com/EBISPOT/gwas-sumstats-harmoniser/releases/tag/v1.1.4
  • Updated harmonisation scripts https://github.com/EBISPOT/gwas-utils/pull/167
  • Updated harmonisation wrappers https://github.com/EBISPOT/gwas-utils/pull/168
  • Created new container configs
  • Updated NXF asset values in scripts

karatugo avatar Mar 19 '24 14:03 karatugo

For gwas-sumstats-harmoniser migration:

Test submitted to codon-slurm but failed. @jiyue1214 is helping me to investigate the problem.

karatugo avatar Mar 19 '24 14:03 karatugo

Released https://github.com/EBISPOT/gwas-sumstats-harmoniser/releases/tag/v1.1.5 and https://github.com/EBISPOT/gwas-sumstats-harmoniser/releases/tag/v1.0.6 and submitted the test files to codon-slurm again.

karatugo avatar Mar 19 '24 16:03 karatugo

For gwas-sumstats-harmoniser migration:

Test submitted to codon-slurm and it's successful. There's one small mistake in meta.yaml files. @jiyue1214 is helping me to investigate the problem.

karatugo avatar Mar 20 '24 13:03 karatugo

Thanks to @jiyue1214 fix, released v1.1.7 and v1.0.7 now and testing again in codon-slurm.

[gwas_lsf@codon-dm-06 cron]$ ./start_harmonisation_slurm_test_goci1179.sh 
Submitted batch job 65232999

karatugo avatar Mar 27 '24 16:03 karatugo

For gwas-sumstats-harmoniser migration:

I compared the output of the harmonisation pipeline in SLURM and LSF.

  • .h.tsv.gz, .h.tsv.gz.tbi, md5sum.txt files are identical.
  • In running.log, we have a higher percentage of sites that carried forward.
  • In meta yaml, @jiyue1214 fixed a few bugs (coordinate system and samples). (thanks @jiyue1214 !)

I suggest we deploy this after the Easter long weekend. I'll coordinate it with Yue.

karatugo avatar Mar 28 '24 12:03 karatugo

This is waiting for final update from @jiyue1214

sprintell avatar Apr 04 '24 10:04 sprintell

Issue: In running.log, we have a higher percentage of sites that are carried forward.

Primary investigation: Percentage of sites that are carried forward = Carried forward variants / ( Carried forward variants + Unmapped variants). Based on the log file, the number of sites that are carried forward are same, which means the difference is caused by the unmapped variants. To investigate the reason why unmapped variants are different, I need to rerun the pipeline and use intermediate files to help.

jiyue1214 avatar Apr 18 '24 21:04 jiyue1214

I rerun the pipeline with the intermediate files and found:

  1. Their intermediate files are the same (md5sum of two unmapped files are identical)
  2. I can repeat the slight difference between the LSF and Slurm, but the Slurm result is the correct number.
  3. In the LSF, nextflow read the GCST90293086's unmapped file to GCST90293085 log work folder. However, in the Slurm, it is it was the correct one.

This is not supported by the code difference. However, to double-check it, yue can change the LSF code to slurm (only change the executor.)

jiyue1214 avatar Apr 30 '24 21:04 jiyue1214

I confirm the Slurm result is correct. We can close this ticket. For the reason causing the problem on LSF (the Harmonisation result is correct, only the unmapped file did not match the GCST), I will generate another ticket to look into more details.

jiyue1214 avatar May 01 '24 09:05 jiyue1214

@karatugo to release

ljwh2 avatar May 01 '24 10:05 ljwh2

@jiyue1214 added additional feature, waiting for Yue before releasing

ljwh2 avatar May 22 '24 09:05 ljwh2

  1. All scripts are ready and will start to run today via crontab
  2. A small action is that I will active scrontab instead of crontab based on the ITSC info

jiyue1214 avatar Jun 05 '24 09:06 jiyue1214

Nextflow pipeline is running on Slurm and can be monitored by the nextflow tower daily. Question: @karatugo, According to the scrontab, we have not activated the refresh harmonisation queue, queue GWAS-SSF files for harmonisation, and queue pre-GWAS-SSF files for harmonisation. Should we activate them as well?Screenshot 2024-06-11 at 21.47.49.png

jiyue1214 avatar Jun 11 '24 20:06 jiyue1214

We have migrated all crontab jobs to Slurm this morning. This ticket can be moved to Done. Just need to double-check if they are running successfully tomorrow.

jiyue1214 avatar Jun 12 '24 08:06 jiyue1214

been release at the moment, ticket due to be closed at end of sprint

sprintell avatar Jun 19 '24 10:06 sprintell