goci icon indicating copy to clipboard operation
goci copied to clipboard

Reharmonising a file with error

Open ljwh2 opened this issue 1 year ago • 10 comments

GCST002047 was not harmonised successfully because our harmonisation pipeline cannot recognise the column “Effect_Allele”. The harmonisation pipeline reads the “effect_allele” column in the input file to harmonise the variant. However, all data in this column is NA. This is the reason why all variants give hm 14. If we change the header of this file, it should be able to be harmonised. (same as other_allele)

Please fix the file and re-qeue for harmonisation

ljwh2 avatar Jan 10 '24 16:01 ljwh2

When unzipped two files appeared. I fixed the header for both but the metadata yaml files are missing. Since the study is old, the data is not available from the ingest api.

karatugo avatar Mar 19 '24 15:03 karatugo

  • [x] Create metadata yaml files that contain GCST ID and genome assembly for the harmonisation.
  • [x] Submit the harmonisation

karatugo avatar Mar 19 '24 15:03 karatugo

Submitted to codon with the submission script /hps/software/users/parkinso/spot/gwas/prod/scripts/cron/start_harmonisation_pre_standard_goci1226.sh

Job <92779129> is submitted to default queue <standard>.

karatugo avatar Mar 19 '24 16:03 karatugo

  • [x] Compare two studies in FTP ("EduYears" and "College")
  • [x] If they are the same, replace the zipped file with the correct one - Moved SSGAC_College_Rietveld2013_publicrelease.txt to GCST002001-GCST003000/GCST002047 and removed the zipped file
  • [x] Harmonise with pre_gwas_ssf
  • [x] Also, upload only one harmonised file with the correct title

karatugo avatar Mar 20 '24 10:03 karatugo

  • [x] Use variant_id in the header (as per pre_gwas_ssf standard)
  • [x] Rename the file with their GCST_ID.txt for harmonisation

karatugo avatar Mar 20 '24 16:03 karatugo

Using the script at /hps/software/users/parkinso/spot/gwas/prod/scripts/cron/start_harmonisation_pre_standard_goci1226.sh

Job <93028342> is submitted to default queue <standard>.

karatugo avatar Mar 20 '24 16:03 karatugo

Added chromosome and bas_pair_location columns filled with NA and submitted again.

Job <93126943> is submitted to default queue <standard>.

karatugo avatar Mar 21 '24 10:03 karatugo

Harmonised files, metadata files, running logs and .tbi files are copied to the respective harmonised directories.

  • http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST002001-GCST003000/GCST002047/harmonised/
  • http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST008001-GCST009000/GCST008396/harmonised/

karatugo avatar Mar 27 '24 11:03 karatugo

This is confirmed done, @earlEBI will double check

sprintell avatar Apr 04 '24 09:04 sprintell

Reopening as the yaml files do not look quite right. (is_harmonised = false). Also, the .tbi files should be renamed .tbi.gz.

earlEBI avatar Apr 04 '24 15:04 earlEBI

Fixed the following fields:

genome_assembly: GRCh38
is_harmonised: true
is_sorted: true

@earlEBI Could you check again please? Thanks.

karatugo avatar May 17 '24 14:05 karatugo

@earlEBI please confirm

ljwh2 avatar May 22 '24 09:05 ljwh2

The yamls are only five lines long. Should they not contain more detail?

Screenshot 2024-05-22 at 10 21 14

earlEBI avatar May 22 '24 09:05 earlEBI

I thought that's because it's a very old submission. And also they are not available in the ingest api. @sajo-ebi

https://www.ebi.ac.uk/gwas/ingest/api/v2/studies/GCST008396 https://www.ebi.ac.uk/gwas/ingest/api/v2/studies/GCST002047

karatugo avatar May 22 '24 12:05 karatugo

Old studies are meant to be retrieved fromthe public rest API: https://www.ebi.ac.uk/gwas/rest/api/studies/GCST008396

sprintell avatar May 23 '24 09:05 sprintell

TODO: Update sumstats tools so that we fetch the REST API if Ingest API does not return any data.

karatugo avatar May 24 '24 15:05 karatugo

Harmonization done, but yaml file has some missing data.

sprintell avatar May 29 '24 09:05 sprintell

Regenerated YAML files for GCST002047 and GCST008396. Expect them in the public ftp in 2 days.

karatugo avatar Jun 03 '24 16:06 karatugo

YAML files are in staging FTP but not in public FTP. The reason why it didn't sync is in our ftp-sync code, we only filter the files that start with 'GCST*'. See https://github.com/EBISPOT/gwas-utils/blob/6fbf2c7a6d6fdfc79e0b8c2d1e74539bb1073303/ftpSummaryStatsScript/ftp_sync.py#L186-L188

Will renamed files, expect them in the public ftp in 2 days.

karatugo avatar Jun 10 '24 13:06 karatugo

Agreed to keep original files as per old guidelines

ljwh2 avatar Jun 12 '24 09:06 ljwh2