nanopolish icon indicating copy to clipboard operation
nanopolish copied to clipboard

genotyping missing as ref

Open webbchen opened this issue 7 years ago • 7 comments

Dear Jared

I'm genotyping some fungi with nanopolish variants using a vcf file with known variants as --genotypes:

python /path/to/software/nanopolish-0.9.2/nanopolish/scripts/nanopolish_makerange.py --overlap-length 0 /path/to/reference.fasta | parallel --results sample_nanopolish.results -P 3 nanopolish variants -m 0.8 -d 10 --genotype /path/to/SNPs_nanopore_merged.vcf -o sample_nanopolish_final_{1}.vcf -w {1} -p 1 -r /path/to/reads/sample_albacore_trimmed.fastq -b /path/to/alignments/sample_albacore_to_ref_bwa-mem_sorted.bam -g /path/to/reference.fasta -t 6 and I'd expect to get GT being 0 for the site being equal to the reference, 1 if it's the alternative allele and . if it's missing due to no alignment, for example. The fungus in question frequently loses chromosomes or parts thereof and also some chromosomes are rearranged heavily and have bits added and missing, so there are lots of regions which won't have any alignments. The calls which should be missing do come out as 0 though, which is wrong, like in the example below which straddles a gap in the alignment:

`chr1.1 62654 . G A 0.0 PASS TotalReads=2;AlleleCount=0;SupportFraction=0.531566 GT 0 chr1.1 62657 . T C 0.0 PASS TotalReads=2;AlleleCount=0;SupportFraction=0.488378 GT 0 chr1.1 62724 . C G 0.0 PASS TotalReads=0;AlleleCount=0;SupportFraction=-nan GT 0 chr1.1 62758 . G A 0.0 PASS TotalReads=0;AlleleCount=0;SupportFraction=-nan GT 0 chr1.11 62790 . C T 0.0 PASS TotalReads=0;AlleleCount=0;SupportFraction=-nan GT 0

(...)

chr1.1 79986 . G A 0.0 PASS TotalReads=0;AlleleCount=0;SupportFraction=-nan GT 0 chr1.1 80030 . T C 0.0 PASS TotalReads=0;AlleleCount=0;SupportFraction=-nan GT 0 chr1.1 80131 . C T 0.0 PASS TotalReads=0;AlleleCount=0;SupportFraction=-nan GT 0 chr1.1 80328 . T C 0.0 PASS TotalReads=1;AlleleCount=0;SupportFraction=0.996983 GT 0 chr1.1 81011 . T C 20.5 PASS TotalReads=2;AlleleCount=1;SupportFraction=0.999939 GT 1 chr1.1 81138 . A G 0.0 PASS TotalReads=2;AlleleCount=0;SupportFraction=0.962576 GT 0 ` Is there any way in which I can amend that behaviour?

Kind regards,

A. Webb

webbchen avatar Aug 22 '18 10:08 webbchen

Hi,

Thanks for reporting this. I'm on vacation for the next 10 days but I'll fix this when I get back. I will probably just output a genotype of "." (missing) when TotalReads=0.

Jared

jts avatar Aug 22 '18 13:08 jts

Hi Jared

Thanks a lot for your fast reply! I've got two more suggestions for nanopolish variants:

  • adding the read group ID from the bam file (where present) rather than "sample" to the resulting vcf file would be very useful
  • throwing an error if the file after the --genotype parameter doesn't exist instead of printing empty output

Kind regards,

A. Webb

webbchen avatar Aug 23 '18 12:08 webbchen

Hi Jared

has that issue been addressed in your new release?

Kind regards A.Webb

webbchen avatar Sep 17 '18 11:09 webbchen

Hi,

Sorry but it hasn't. I'm travelling again now so it will take a little more time.

Jared

jts avatar Sep 17 '18 12:09 jts

Commit 5a4939d will emit . for the genotype when TotalReads is 0

jts avatar Sep 25 '18 17:09 jts

The --genotype argument appears to be handled properly:

variants: option '--genotype' requires an argument

jts avatar Sep 25 '18 17:09 jts

I'd like to add full support for read groups in the future but I can't do it right now. I'm going to leave the genotype column as "sample" for now.

jts avatar Sep 25 '18 17:09 jts