vg icon indicating copy to clipboard operation
vg copied to clipboard

What is the best way to generate VCF for vg-toolkit given raw sequences

Open sumit-walia opened this issue 2 years ago • 3 comments

I am trying to generate variation graph (vg) from raw SARS-CoV2 sequences (~16k sequences). What would be the best way to generate VCF given these raw sequences? And, does the method works as the data scales up?

sumit-walia avatar Jul 06 '22 00:07 sumit-walia

We used to support vg msga for making graphs from sequences, but it doesn't scale that high in sequence count.

You could try making a graph with Minigraph, or Minigraph and Cactus together, though I don't know if that would scale enough either. @glennhickey might be able to guess.

You could also throw @ekg's PGGB tool at the problem.

If you want to go sequences -> VCF -> graph... I don't know what tool you would use there. You might want your own tool based on individual pairwise alignments against whatever you are using for your reference.

I've never seen a VCF with genotypes for 16,000 samples in it, let alone tried to run vg on it. But the GBWT should store haplotypes in sub-linear space, so it might actually work.

If you aren't working with the genotypes but just the variable sites, vg probably ought to handle the VCF just fine as input.

adamnovak avatar Jul 07 '22 20:07 adamnovak

If the sequences are mostly related by simple mutations (e.g. small indels, substitutions), you could also generate a multiple sequence alignment with an external tools and then use it to construct the graph with vg construct -M.

jeizenga avatar Jul 07 '22 21:07 jeizenga

You could try making a graph with Minigraph, or Minigraph and Cactus together, though I don't know if that would scale enough either. @glennhickey might be able to guess.

minigraph-cactus will not scale to 16k sequences. If you have a guide tree, "regular" cactus should work fine but then you wouldn't get a nice vcf.

glennhickey avatar Jul 28 '22 13:07 glennhickey