vg icon indicating copy to clipboard operation
vg copied to clipboard

Feature request: Accept jellyfish kmer counts for vg haplotypes

Open JosephLalli opened this issue 1 year ago • 2 comments

For more background see https://github.com/eblerjana/pangenie/issues/62.

Long story short, I'm trying to replicate and make use of the personalized pangenome pipeline described in your recent paper (https://www.biorxiv.org/content/10.1101/2023.12.13.571553v2.full).

When using Pangenie to genotype the graph created by vg haplotypes from a human 30X Illumina fastq dataset, a representative run in my hands spends 1484s out of a total runtime of 1910s counting fastq kmer reads. Pangenie is able to accept pre-counted kmer files, but only if they are in Jellyfish2's format. Internally, Pangenie uses the jellyfish api for kmer management.

It seems that using kff files is difficult for Pangenie, since they do not appear to allow for random access. So, maybe we could use jellyfish to count kmers, and provide those counts to vg haplotypes? That would avoid having two different algorithms count the same kmers twice.

Best, Joe

JosephLalli avatar Jan 25 '24 17:01 JosephLalli

We chose KFF because we wanted to avoid adding yet another major dependency. VG already has too many of them, making the build system fragile.

As for random access, we also need it in vg haplotypes. We simply load the kmer counts into a hash map. On my laptop, that takes ~100 seconds for the counts from 30x reads: 25 seconds for prepopulating the hash map with the kmers we are interested in and 75 seconds for multithreaded reading.

jltsiren avatar Jan 25 '24 18:01 jltsiren

Understood. I agree about the dependencies!

I'll copy your comment on the similar issue I created at Pangenie (https://github.com/eblerjana/pangenie/issues/62). Maybe you and Jana can help each other get behind one kmer ecosystem for pangenome analysis.

Best, Joe

JosephLalli avatar Jan 25 '24 18:01 JosephLalli