goci
goci copied to clipboard
Reharmonising a file with error
GCST002047 was not harmonised successfully because our harmonisation pipeline cannot recognise the column “Effect_Allele”. The harmonisation pipeline reads the “effect_allele” column in the input file to harmonise the variant. However, all data in this column is NA. This is the reason why all variants give hm 14. If we change the header of this file, it should be able to be harmonised. (same as other_allele)
Please fix the file and re-qeue for harmonisation
When unzipped two files appeared. I fixed the header for both but the metadata yaml files are missing. Since the study is old, the data is not available from the ingest api.
- [x] Create metadata yaml files that contain GCST ID and genome assembly for the harmonisation.
- [x] Submit the harmonisation
Submitted to codon with the submission script /hps/software/users/parkinso/spot/gwas/prod/scripts/cron/start_harmonisation_pre_standard_goci1226.sh
Job <92779129> is submitted to default queue <standard>.
- [x] Compare two studies in FTP ("EduYears" and "College")
- [x] If they are the same, replace the zipped file with the correct one - Moved
SSGAC_College_Rietveld2013_publicrelease.txt
toGCST002001-GCST003000/GCST002047
and removed the zipped file - [x] Harmonise with
pre_gwas_ssf
- [x] Also, upload only one harmonised file with the correct title
- [x] Use
variant_id
in the header (as perpre_gwas_ssf
standard) - [x] Rename the file with their
GCST_ID.txt
for harmonisation
Using the script at /hps/software/users/parkinso/spot/gwas/prod/scripts/cron/start_harmonisation_pre_standard_goci1226.sh
Job <93028342> is submitted to default queue <standard>.
Added chromosome
and bas_pair_location
columns filled with NA
and submitted again.
Job <93126943> is submitted to default queue <standard>.
Harmonised files, metadata files, running logs and .tbi files are copied to the respective harmonised directories.
- http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST002001-GCST003000/GCST002047/harmonised/
- http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST008001-GCST009000/GCST008396/harmonised/
This is confirmed done, @earlEBI will double check
Reopening as the yaml files do not look quite right. (is_harmonised = false). Also, the .tbi files should be renamed .tbi.gz.
Fixed the following fields:
genome_assembly: GRCh38
is_harmonised: true
is_sorted: true
@earlEBI Could you check again please? Thanks.
@earlEBI please confirm
The yamls are only five lines long. Should they not contain more detail?
I thought that's because it's a very old submission. And also they are not available in the ingest api. @sajo-ebi
https://www.ebi.ac.uk/gwas/ingest/api/v2/studies/GCST008396 https://www.ebi.ac.uk/gwas/ingest/api/v2/studies/GCST002047
Old studies are meant to be retrieved fromthe public rest API: https://www.ebi.ac.uk/gwas/rest/api/studies/GCST008396
TODO: Update sumstats tools so that we fetch the REST API if Ingest API does not return any data.
Harmonization done, but yaml file has some missing data.
Regenerated YAML files for GCST002047 and GCST008396. Expect them in the public ftp in 2 days.
YAML files are in staging FTP but not in public FTP. The reason why it didn't sync is in our ftp-sync code, we only filter the files that start with 'GCST*'. See https://github.com/EBISPOT/gwas-utils/blob/6fbf2c7a6d6fdfc79e0b8c2d1e74539bb1073303/ftpSummaryStatsScript/ftp_sync.py#L186-L188
Will renamed files, expect them in the public ftp in 2 days.
Agreed to keep original files as per old guidelines