`vg call` strips path/contig info from vcf
1. What were you trying to do?
- Produce a VCF from a mapped GAM
vg call graph.gbz --pack graph.pack --snarls graph.snarls --genotype-snarls --all-snarls --gbz-translation --gbz > example.vcf
2. What did you want to happen?
- I expected the VCF #CHROM column to retain the full contig/path name (which is in PanSN format), like the behaviour of
vg deconstruct. e.g.:
#CHROM
simChimp#0#simChimp.chr6
3. What actually happened?
- It stripped out the other info in the contig name, leaving the below
- This means I can't pipe the vcf further into bcftools for normalisation vs the reference.fasta
#CHROM
simChimp.chr6
5. What data and command can the vg dev team use to make the problem happen?
I did this using the simChimp example from Minigraph-Cactus, but I assume any gbz with PanSN contig naming.
6. What does running vg version say?
v1.61.0 "Plodio"
I think this may be the same issue as #4442. I assumed I could run vg call without specifying a reference sample with -S (for a graph with only one ref sample), as according to the -p readme it should default to all reference paths
-p, --ref-path NAME Reference path to call on (multipile allowed. defaults to all paths)
Cheers for any advice!
Yeah, it looks like vg call will only add the PANSN prefix if it thinks there can be ambiguity between different samples in the VCF. Probably a good idea to add an option (like deconstrut) to let the user force the issue, but in the meantime you'll have to use sed or something like that to add it yourself...
Thanks Glenn! That sounds like it would be pretty useful feature, as I'm not sure how to force sample ambiguity when there will only be one mapped sample handled per task.
In the meantime I will simplify the contig naming in the FASTA to only the contig name.
Cheers for your help
P.S. I realise my suggestion wouldn't work as of course the PanSN notation is recreated in the pipeline - I'll do sed on the reference FASTA as you suggested