pgsc_calc Error: results missing for single sample

Description of the bug

In the current dev build, the report is made, but it does not contain any columns except for SUM:

   sampleset    IID           PGS           SUM Z_MostSimilarPop Z_norm1 Z_norm2
   <chr>        <chr>         <chr>       <dbl> <lgl>            <lgl>   <lgl>  
 1 testfile testfile.txt PGS00075… -0.220  NA               NA      NA     
 2 reference    HG00096       PGS00075…  0.258  NA               NA      NA     
 3 reference    HG00097       PGS00075…  0.0698 NA               NA      NA     
 4 reference    HG00099       PGS00075…  0.0588 NA               NA      NA     
 5 reference    HG00100       PGS00075…  0.382  NA               NA      NA     
 6 reference    HG00101       PGS00075…  0.525  NA               NA      NA     
 7 reference    HG00102       PGS00075… -0.327  NA               NA      NA     
 8 reference    HG00103       PGS00075… -0.285  NA               NA      NA     
 9 reference    HG00105       PGS00075…  0.214  NA               NA      NA     
10 reference    HG00106       PGS00075… -1.13   NA               NA      NA

This is in both the .html report as well as the raw testfile_pgs.txt.gz file.

In that file, only the SUM column is populated.

However, when using the next most current version (alpha 4), all columns are correctly populated (despite technically failing on the report making step, https://github.com/PGScatalog/pgsc_calc/issues/242).

I know the build is dev and not released yet, but it might happen on alpha 5 too (I'm unable to test it because of the _vcf filename error).

Command used and terminal output

nextflow run pgscatalog/pgsc_calc -profile singularity --input samplesheet.csv --pgs_id PGS000758 --target_build GRCh37 --min_overlap 0.0 --run_ancestry pgsc_1000G_v1.tar.zst -c custom.config -r v2.0.0-alpha.4e

Or

nextflow run pgscatalog/pgsc_calc -profile singularity --input samplesheet.csv --pgs_id PGS000758 --target_build GRCh37 --min_overlap 0.0 --run_ancestry pgsc_1000G_v1.tar.zst -c custom.config -r dev

Relevant files

No response

System information

Ubuntu, Docker, Singularity, current Nextflow

Apr 18 '24 23:04 Fiwx

Thanks for the bug report! Sorry, I can't reproduce on the dev branch. Here's what I tried:

$ cd /path/to/pgsc_calc
$ rm -r work results  # guarantee a fresh run
$ nextflow run main.nf -profile docker,arm \
    --run_ancestry ../pgsc_1000G_v1.tar.zst \
    --input ../hgdp/split/samplesheet.csv \
    --pgs_id PGS000758 \
    --target_build GRCh38
$ head <(gzcat results/hgdp/score/hgdp_pgs.txt.gz) | column -t
sampleset  IID        PGS                     SUM                  Z_MostSimilarPop      Z_norm1               Z_norm2              percentile_MostSimilarPop
hgdp       HGDP00001  PGS000758_hmPOS_GRCh38  -0.3588901000000001  1.664053980073892     0.8584426674256617    0.8134837998289269   95.49902152641879
hgdp       HGDP00003  PGS000758_hmPOS_GRCh38  -0.40938197          1.573172743793694     0.7932776666255601    0.7547724248426887   94.71624266144813
hgdp       HGDP00005  PGS000758_hmPOS_GRCh38  0.0158892699999999   2.338626192945269     1.6154600095825224    1.527703433039602    98.4344422700587
hgdp       HGDP00007  PGS000758_hmPOS_GRCh38  -0.9636511           0.5755336433533662    -0.2936306959068118   -0.2642672147215713  72.40704500978474

I noticed --min_overlap is 0 on your run. What kind of variant matching rates do you normally get?

@smlmbrt is the expert (and out of office 🌴 until next week ) but perhaps low variant match rates could contribute to NA values. Some changes were made to the ancestry normalisation steps to handle low variance cases in the most recent release.

Apr 19 '24 11:04 nebfield

Thanks for testing it. I'll try again. The match rates were high (99.x%) in the alpha 4 version run, and the genome is imputed (between 20-30 million variants).

Apr 19 '24 15:04 Fiwx

I tried again after creating a clean new set up, and got the same results.

$ nextflow run pgscatalog/pgsc_calc -profile singularity --input \
    /home/ubuntu/custom/dev/ca/samplesheet.csv --pgs_id PGS000758 \
    --target_build GRCh37 --min_overlap 0.0 --run_ancestry \
    /home/ubuntu/custom/data/pgsc_1000G_v1.tar.zst -c \
    /home/ubuntu/custom/references/custom.config -r dev

Version 2.0.0-alpha.5

reference	n target	N variants in panel	n (matched)	% matched
1	1000G	27904794	85277655	27904796	32.72

Sampleset	Scoring file	Number of variants	Passed matching	Match %	Total matched	Total unmatched
newautosomal	PGS000758_hmPOS_GRCh37	33938	TRUE	99.2	33668	270

   sampleset    IID           PGS           SUM Z_MostSimilarPop Z_norm1 Z_norm2
   <chr>        <chr>         <chr>       <dbl> <lgl>            <lgl>   <lgl>  
 1 newautosomal autosomal.txt PGS00075… -0.220  NA               NA      NA     
 2 reference    HG00096       PGS00075…  0.258  NA               NA      NA     
 3 reference    HG00097       PGS00075…  0.0698 NA               NA      NA     
 4 reference    HG00099       PGS00075…  0.0588 NA               NA      NA     
 5 reference    HG00100       PGS00075…  0.382  NA               NA      NA     
 6 reference    HG00101       PGS00075…  0.525  NA               NA      NA     
 7 reference    HG00102       PGS00075… -0.327  NA               NA      NA     
 8 reference    HG00103       PGS00075… -0.285  NA               NA      NA     
 9 reference    HG00105       PGS00075…  0.214  NA               NA      NA     
10 reference    HG00106       PGS00075… -1.13   NA               NA      NA   
# ℹ 2,481 more rows
# ℹ 1 more variable: percentile_MostSimilarPop <lgl>

Apr 20 '24 21:04 Fiwx

I think @nebfield is right, it has probably triggered this exception which should only be applied to the target when there's more than 3 samples: https://github.com/PGScatalog/pgscatalog_utils/blob/b5962bf5f12bb2aba9d51a3c569a0d831072ecf0/pgscatalog_utils/ancestry/tools.py#L250-L253

Apr 22 '24 08:04 smlmbrt

@smlmbrt Great! I'll try a test with more samples sometime.

Also, is it possible to recover z-scores and percentiles which are not normalized for ancestry from only the values in SUM?

My understanding is that SUM is calculated only on variants that are a subset of the variants in the submitted sample(s), the reference panel (e.g. 1000G), and the scorefile.

Since the variants are the same between the reference and submitted samples, I assume it would be a "fair" comparison (e.g., no normalization for matched variant number needed).

Apr 22 '24 10:04 Fiwx

Also, is it possible to recover z-scores and percentiles which are not normalized for ancestry from only the values in SUM?

The percentiles and z for the most similar population are not normalised, they just use that as the reference distribution.

My understanding is that SUM is calculated only on variants that are a subset of the variants in the submitted sample(s), the reference panel (e.g. 1000G), and the scorefile.

Correct.

Since the variants are the same between the reference and submitted samples, I assume it would be a "fair" comparison (e.g., no normalization for matched variant number needed).

I think this will depend on your use case, the reason for using reference populations is because the mean and variance of the PGS distribution is caused by allele frequency and LD. If these are unmatched than an individual's relative place in a distribution will be incorrect.

Apr 23 '24 09:04 smlmbrt

pgsc_calc pgsc_calc copied to clipboard

Error: results missing for single sample

Description of the bug

Command used and terminal output

Relevant files

System information

pgsc_calc
pgsc_calc copied to clipboard