pgsc_calc
pgsc_calc copied to clipboard
Error: results missing for single sample
Description of the bug
In the current dev build, the report is made, but it does not contain any columns except for SUM:
sampleset IID PGS SUM Z_MostSimilarPop Z_norm1 Z_norm2
<chr> <chr> <chr> <dbl> <lgl> <lgl> <lgl>
1 testfile testfile.txt PGS00075… -0.220 NA NA NA
2 reference HG00096 PGS00075… 0.258 NA NA NA
3 reference HG00097 PGS00075… 0.0698 NA NA NA
4 reference HG00099 PGS00075… 0.0588 NA NA NA
5 reference HG00100 PGS00075… 0.382 NA NA NA
6 reference HG00101 PGS00075… 0.525 NA NA NA
7 reference HG00102 PGS00075… -0.327 NA NA NA
8 reference HG00103 PGS00075… -0.285 NA NA NA
9 reference HG00105 PGS00075… 0.214 NA NA NA
10 reference HG00106 PGS00075… -1.13 NA NA NA
This is in both the .html report as well as the raw testfile_pgs.txt.gz file.
In that file, only the SUM column is populated.
However, when using the next most current version (alpha 4), all columns are correctly populated (despite technically failing on the report making step, https://github.com/PGScatalog/pgsc_calc/issues/242).
I know the build is dev and not released yet, but it might happen on alpha 5 too (I'm unable to test it because of the _vcf filename error).
Command used and terminal output
nextflow run pgscatalog/pgsc_calc -profile singularity --input samplesheet.csv --pgs_id PGS000758 --target_build GRCh37 --min_overlap 0.0 --run_ancestry pgsc_1000G_v1.tar.zst -c custom.config -r v2.0.0-alpha.4e
Or
nextflow run pgscatalog/pgsc_calc -profile singularity --input samplesheet.csv --pgs_id PGS000758 --target_build GRCh37 --min_overlap 0.0 --run_ancestry pgsc_1000G_v1.tar.zst -c custom.config -r dev
Relevant files
No response
System information
Ubuntu, Docker, Singularity, current Nextflow
Thanks for the bug report! Sorry, I can't reproduce on the dev branch. Here's what I tried:
$ cd /path/to/pgsc_calc
$ rm -r work results # guarantee a fresh run
$ nextflow run main.nf -profile docker,arm \
--run_ancestry ../pgsc_1000G_v1.tar.zst \
--input ../hgdp/split/samplesheet.csv \
--pgs_id PGS000758 \
--target_build GRCh38
$ head <(gzcat results/hgdp/score/hgdp_pgs.txt.gz) | column -t
sampleset IID PGS SUM Z_MostSimilarPop Z_norm1 Z_norm2 percentile_MostSimilarPop
hgdp HGDP00001 PGS000758_hmPOS_GRCh38 -0.3588901000000001 1.664053980073892 0.8584426674256617 0.8134837998289269 95.49902152641879
hgdp HGDP00003 PGS000758_hmPOS_GRCh38 -0.40938197 1.573172743793694 0.7932776666255601 0.7547724248426887 94.71624266144813
hgdp HGDP00005 PGS000758_hmPOS_GRCh38 0.0158892699999999 2.338626192945269 1.6154600095825224 1.527703433039602 98.4344422700587
hgdp HGDP00007 PGS000758_hmPOS_GRCh38 -0.9636511 0.5755336433533662 -0.2936306959068118 -0.2642672147215713 72.40704500978474
I noticed --min_overlap is 0 on your run. What kind of variant matching rates do you normally get?
@smlmbrt is the expert (and out of office 🌴 until next week ) but perhaps low variant match rates could contribute to NA values. Some changes were made to the ancestry normalisation steps to handle low variance cases in the most recent release.
Thanks for testing it. I'll try again. The match rates were high (99.x%) in the alpha 4 version run, and the genome is imputed (between 20-30 million variants).
I tried again after creating a clean new set up, and got the same results.
$ nextflow run pgscatalog/pgsc_calc -profile singularity --input \
/home/ubuntu/custom/dev/ca/samplesheet.csv --pgs_id PGS000758 \
--target_build GRCh37 --min_overlap 0.0 --run_ancestry \
/home/ubuntu/custom/data/pgsc_1000G_v1.tar.zst -c \
/home/ubuntu/custom/references/custom.config -r dev
Version 2.0.0-alpha.5
| reference | n target | N variants in panel | n (matched) | % matched | |
|---|---|---|---|---|---|
| 1 | 1000G | 27904794 | 85277655 | 27904796 | 32.72 |
| Sampleset | Scoring file | Number of variants | Passed matching | Match % | Total matched | Total unmatched |
|---|---|---|---|---|---|---|
| newautosomal | PGS000758_hmPOS_GRCh37 | 33938 | TRUE | 99.2 | 33668 | 270 |
sampleset IID PGS SUM Z_MostSimilarPop Z_norm1 Z_norm2
<chr> <chr> <chr> <dbl> <lgl> <lgl> <lgl>
1 newautosomal autosomal.txt PGS00075… -0.220 NA NA NA
2 reference HG00096 PGS00075… 0.258 NA NA NA
3 reference HG00097 PGS00075… 0.0698 NA NA NA
4 reference HG00099 PGS00075… 0.0588 NA NA NA
5 reference HG00100 PGS00075… 0.382 NA NA NA
6 reference HG00101 PGS00075… 0.525 NA NA NA
7 reference HG00102 PGS00075… -0.327 NA NA NA
8 reference HG00103 PGS00075… -0.285 NA NA NA
9 reference HG00105 PGS00075… 0.214 NA NA NA
10 reference HG00106 PGS00075… -1.13 NA NA NA
# ℹ 2,481 more rows
# ℹ 1 more variable: percentile_MostSimilarPop <lgl>
I think @nebfield is right, it has probably triggered this exception which should only be applied to the target when there's more than 3 samples: https://github.com/PGScatalog/pgscatalog_utils/blob/b5962bf5f12bb2aba9d51a3c569a0d831072ecf0/pgscatalog_utils/ancestry/tools.py#L250-L253
@smlmbrt Great! I'll try a test with more samples sometime.
Also, is it possible to recover z-scores and percentiles which are not normalized for ancestry from only the values in SUM?
My understanding is that SUM is calculated only on variants that are a subset of the variants in the submitted sample(s), the reference panel (e.g. 1000G), and the scorefile.
Since the variants are the same between the reference and submitted samples, I assume it would be a "fair" comparison (e.g., no normalization for matched variant number needed).
Also, is it possible to recover z-scores and percentiles which are not normalized for ancestry from only the values in SUM?
The percentiles and z for the most similar population are not normalised, they just use that as the reference distribution.
My understanding is that SUM is calculated only on variants that are a subset of the variants in the submitted sample(s), the reference panel (e.g. 1000G), and the scorefile.
Correct.
Since the variants are the same between the reference and submitted samples, I assume it would be a "fair" comparison (e.g., no normalization for matched variant number needed).
I think this will depend on your use case, the reason for using reference populations is because the mean and variance of the PGS distribution is caused by allele frequency and LD. If these are unmatched than an individual's relative place in a distribution will be incorrect.