goci
goci copied to clipboard
OTAR prioritised splitting of old GWAS/QTL studies to match summary statistics design
James and Annalisa met with Jeremy S. and Mohd on 8/10/21 and started addressing a plan to split the PMIDs in the list below to match their summary statistics design. At the moment, these publications have all sumstats nested in a single folder & linked to one study ID. All the files, however, have been already formatted and harmonised for OTAR - and therefore ready for submission (with James), once the curation part is completed via the depo curation app (from Annalisa).
Annalisa and Laura have discussed a curation plan for these moving forward, including how to streamline EFO requests. Following that discussion, I am creating this ticket.
—KETTUNEN - 2016 Nat Comm - GCST003664 (123 files) - done https://www.ebi.ac.uk/gwas/publications/27005778
—-DRAISMA - 2015 Nat Comm, GCST002959 — (126 files) https://www.ebi.ac.uk/gwas/publications/26068415
—SHIN - 2014 Nat Genet http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST002001- (529 files) (PMID 24816252)
—-SUN - 2018 Nat Genet - GCST005806, one giant tar file 1,478 files https://www.ebi.ac.uk/gwas/publications/29875488
——SUHRE - 2017 Nat Comm - 1 GCST but many files within harmonised folder - done 1,124 protein levels https://www.ebi.ac.uk/gwas/publications/28240269
We discussed a pipeline to efficiently handle these tasks - 1) Annalisa is creating submissions and filling submission forms for the above publications 2) Sajo is handling the GCSTs creating part as per goci #505 (https://app.zenhub.com/workspaces/gwas-59df823c4a6feb3786810391/issues/ebispot/goci/505) 3) James is in charge the summary statistics part at the end of the pipeline
@jdhayhurst, for Suhre 28240269 my submission form with metadata is ready: https://www.ebi.ac.uk/gwas/deposition/submission/622a122378a7f500016a5342 Please let me know how would you like me to proceed -- shall I submit with NA for all sumstats info, or shall I wait to fill all info when file names etc are ready?
Hi @buniello the suhre file id's are in this text file. These need to be added into the template in the sumstats file field along with the correct genome assembly. In the md5 field, just put "28240269" for all. The file needs to get through the basic field validation and we need the file identifiers in there so that we can map the files to the traits and the GCST when they come.
Thanks @jdhayhurst, I have now submitted the complete form: https://www.ebi.ac.uk/gwas/deposition/submission/622a122378a7f500016a5342
I've forced the validation to pass for the sumstats (callback id vKAzm5Qu). Will get it to go all the way through and then manually released the files
option 1:
- download raw files
- make submission - it will fail
- get gcsts - make sure you can map them to the files
- create dirs on staging (need to think about the binning)
- copy raw files and rename
- run/wait for nightly sumstats sync
- sync harmonised data to ftp
- md5sums for harmonised
- harmonised data are not stored on staging
option 2:
- download raw files
- format files - and validate
- create submission - upload files to globus
- submission needs to succeed
- copy harmonised files when ftp sync has run (remove the studies from the queue so they don't get harmonised again)
option 3:
- download raw files
- format files - and validate
- create submission - upload files to globus
- submission can be coerced through if ftp issues etc
- copy harmonised files when ftp sync has run (remove the studies from the queue so they don't get harmonised again)
option 4:
- download raw files
- make submission - it will fail
- force through submission
- run/wait for nightly sumstats sync
- copy harmonised files when ftp sync has run (remove the studies from the queue so they don't get harmonised again)
- md5sums for harmonised
Ready to sync, but IS_PUBLISH is set to False so they won't sync. Still todo:
- run sumstats sync
- sync harmonised data to ftp
- md5sums for harmonised
Waiting on curation of traits.
All complete - the only thing remaining is to clean up and fix permissions on GCST004365. In the sumstats directory is a hidden version of GCST004365 that should overwrite the existing GCST004365 dir when all of the studies have been published
This is done, but waiting for data release before it can be closed
SUHRE et al -- splitting completed.
@jdhayhurst will finish this up
Suhre has been completed. The rest are outstanding (will require curation) - @ljwh2 was the curation/splitting handed over to anyone?
I will be doing these. I'm just waiting on some EFOs being created for Shin at the moment.
closing as all publications complete