pandora
pandora copied to clipboard
How to better represent dense regions with VCF
This is a question raised from https://github.com/rmcolq/pandora/issues/260 . A VCF record produced by pandora from a very dense region can be downloaded here . This record suffers from readability and interpretability issues (e.g. the VCF record itself has 456k characters), being hard for human and also other tools to consume. IIRC, pandora only outputs alleles that are present in at least one sample. In this case, we have 20 samples and 450 alleles, which makes it really hard to understand what is being genotyped. I guess this is probably due to small variants that are close together, and got merged. I think we need to debug what is the case here: maybe the ML path does not go through these small variants, so they had to be all linearised and described as ALTs in this record; maybe the default nesting level for the PRG was too small to describe such dense region (in this case, if we increased the nestedness, we could be able to break this dense region), etc.
One alternative solution for this case also is to simply output only the genotyped alleles (but I am not sure this is a good practice).
Happy to hear if this is indeed an issue, and of alternative solutions
I think outputting only genotyped alleles by default would make a lot of sense. I am currently always normalising the pandora VCFs to get around this. We could then have a flag to output all alleles (as we do currently) in the cases where people want that extra info.