goci icon indicating copy to clipboard operation
goci copied to clipboard

OTAR prioritised splitting of old GWAS/QTL studies to match summary statistics design

Open buniello opened this issue 4 years ago • 13 comments

James and Annalisa met with Jeremy S. and Mohd on 8/10/21 and started addressing a plan to split the PMIDs in the list below to match their summary statistics design. At the moment, these publications have all sumstats nested in a single folder & linked to one study ID. All the files, however, have been already formatted and harmonised for OTAR - and therefore ready for submission (with James), once the curation part is completed via the depo curation app (from Annalisa).

Annalisa and Laura have discussed a curation plan for these moving forward, including how to streamline EFO requests. Following that discussion, I am creating this ticket.

—KETTUNEN - 2016 Nat Comm - GCST003664 (123 files) - done https://www.ebi.ac.uk/gwas/publications/27005778

—-DRAISMA - 2015 Nat Comm, GCST002959 — (126 files) https://www.ebi.ac.uk/gwas/publications/26068415

—SHIN - 2014 Nat Genet http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST002001- (529 files) (PMID 24816252)

—-SUN - 2018 Nat Genet - GCST005806, one giant tar file 1,478 files https://www.ebi.ac.uk/gwas/publications/29875488

——SUHRE - 2017 Nat Comm - 1 GCST but many files within harmonised folder - done 1,124 protein levels https://www.ebi.ac.uk/gwas/publications/28240269

buniello avatar Oct 12 '21 10:10 buniello

We discussed a pipeline to efficiently handle these tasks - 1) Annalisa is creating submissions and filling submission forms for the above publications 2) Sajo is handling the GCSTs creating part as per goci #505 (https://app.zenhub.com/workspaces/gwas-59df823c4a6feb3786810391/issues/ebispot/goci/505) 3) James is in charge the summary statistics part at the end of the pipeline

buniello avatar Mar 09 '22 15:03 buniello

@jdhayhurst, for Suhre 28240269 my submission form with metadata is ready: https://www.ebi.ac.uk/gwas/deposition/submission/622a122378a7f500016a5342 Please let me know how would you like me to proceed -- shall I submit with NA for all sumstats info, or shall I wait to fill all info when file names etc are ready?

buniello avatar Mar 23 '22 15:03 buniello

Hi @buniello the suhre file id's are in this text file. These need to be added into the template in the sumstats file field along with the correct genome assembly. In the md5 field, just put "28240269" for all. The file needs to get through the basic field validation and we need the file identifiers in there so that we can map the files to the traits and the GCST when they come.

jdhayhurst avatar Mar 25 '22 11:03 jdhayhurst

Thanks @jdhayhurst, I have now submitted the complete form: https://www.ebi.ac.uk/gwas/deposition/submission/622a122378a7f500016a5342

buniello avatar Apr 04 '22 14:04 buniello

I've forced the validation to pass for the sumstats (callback id vKAzm5Qu). Will get it to go all the way through and then manually released the files

jdhayhurst avatar Apr 04 '22 15:04 jdhayhurst

option 1:

  • download raw files
  • make submission - it will fail
  • get gcsts - make sure you can map them to the files
  • create dirs on staging (need to think about the binning)
  • copy raw files and rename
  • run/wait for nightly sumstats sync
  • sync harmonised data to ftp
  • md5sums for harmonised
  • harmonised data are not stored on staging

option 2:

  • download raw files
  • format files - and validate
  • create submission - upload files to globus
  • submission needs to succeed
  • copy harmonised files when ftp sync has run (remove the studies from the queue so they don't get harmonised again)

option 3:

  • download raw files
  • format files - and validate
  • create submission - upload files to globus
  • submission can be coerced through if ftp issues etc
  • copy harmonised files when ftp sync has run (remove the studies from the queue so they don't get harmonised again)

option 4:

  • download raw files
  • make submission - it will fail
  • force through submission
  • run/wait for nightly sumstats sync
  • copy harmonised files when ftp sync has run (remove the studies from the queue so they don't get harmonised again)
  • md5sums for harmonised

jdhayhurst avatar Apr 07 '22 13:04 jdhayhurst

Ready to sync, but IS_PUBLISH is set to False so they won't sync. Still todo:

  • run sumstats sync
  • sync harmonised data to ftp
  • md5sums for harmonised

Waiting on curation of traits.

jdhayhurst avatar Apr 07 '22 13:04 jdhayhurst

All complete - the only thing remaining is to clean up and fix permissions on GCST004365. In the sumstats directory is a hidden version of GCST004365 that should overwrite the existing GCST004365 dir when all of the studies have been published

jdhayhurst avatar Apr 12 '22 11:04 jdhayhurst

This is done, but waiting for data release before it can be closed

sprintell avatar Apr 20 '22 10:04 sprintell

SUHRE et al -- splitting completed.

buniello avatar May 13 '22 08:05 buniello

@jdhayhurst will finish this up

sprintell avatar Jun 01 '22 10:06 sprintell

Suhre has been completed. The rest are outstanding (will require curation) - @ljwh2 was the curation/splitting handed over to anyone?

jdhayhurst avatar Jun 07 '22 08:06 jdhayhurst

I will be doing these. I'm just waiting on some EFOs being created for Shin at the moment.

earlEBI avatar Jun 07 '22 08:06 earlEBI

closing as all publications complete

ljwh2 avatar May 15 '23 12:05 ljwh2