Performance cost of kmer-indexing
This is perhaps an interesting datum. Building the PRG (by which i mean "gramtools build") for 1.3kb of the TB genome, combining calls from 50,000 TB, we find there the max alphabet number is 420 - so 210 sites, which means about 1 site per 6bp.
"maximum thread count: 1",
"Executing build command",
"Generating integer encoded PRG",
"Number of characters in integer encoded linear PRG: 4643131",
"Maximum alphabet character: 420",
"Generating FM-Index",
"Generating PRG masks",
"Building kmer index (kmer size: 7)",
"Getting all kmers",
"Getting kmer prefix diffs",
"Indexing kmers",
"Total number of unique kmers: 16384",
"",
"",
"Timer report:",
" seconds",
" Encoded PRG 0.2",
" Generate FM-Index 8.69",
"Generating PRG masks 7.04",
" Building kmer index 214.43",
"",
"Total elapsed time: 230.36"
I'm raising this because it is taking 8 seconds to build the FM index, and 214s to build the kmer index.
VCF attached
perl_generated_vcf.txt some of these records are horrible. eg see 523027, which starts like this
NC_000962.3 523027 . GCAACACC ACAAAAA,ACAAAAC,ACAAAACA,ACAAAACC,ACAAAACG,ACAAAACT,ACAAAAG,ACAAAAT,ACAAACCA,ACAAACCC,ACAAACCG,ACAAACCT,ACAACAA,ACAACAC,ACAACACA,ACAACACC,ACAACACG,ACAACACT,ACAACAG,ACAACAT,ACAACCCA,ACAACCCC,ACAACCCG,ACAACCCT,ACAAGAA,ACAAGAC,ACAAGACA,ACAAGACC,ACAAGACG,ACAAGACT,ACAAGAG,ACAAGAT,ACAAGCCA,ACAAGCCC,ACAAGCCG,ACAAGCCT,ACAATAA,ACAATAC,ACAATACA,ACAATACC,ACAATACG,ACAATACT,ACAATAG,ACAATAT,ACAATCCA,ACAATCCC,ACAATCCG,ACAATCCT,ACAGAAA,ACAGAAAACA,ACAGAAAACC,ACAGAAAACG,ACAGAAAACT,ACAGAAACCA,ACAGAAACCC,ACAGAAACCG,ACAGAAACCT,ACAGAAC,ACAGAACA,ACAGAACACA,ACAGAACACC,ACAGAACACG,ACAGAACACT,ACAGAACC,ACAGAACCCA,ACAGAACCCC,ACAGAACCCG,ACAGAACCCT,ACAGAACG,ACAGAACT,ACAGAAG,ACAGAAGACA,ACAGAAGACC,ACAGAAGACG,ACAGAAGACT,ACAGAAGCCA,ACAGAAGCCC,ACAGAAGCCG,ACAGAAGCCT,ACAGAAT,ACAGAATACA,ACAGAATACC,ACAGAATACG,ACAGAATACT,ACAGAATCCA,ACAGAATCCC,ACAGAATCCG,ACAGAATCCT,ACAGACCA,ACAGACCC,ACAGACCG,ACAGACCT,ACAGAGAACA,ACAGAGAACC,ACAGAGAACG,ACAGAGAACT,ACAGAGACCA,ACAGAGACCC,ACAGAGACCG,ACAGAGACCT,ACAGAGCACA,ACAGAGCACC,ACAGAGCACG,ACAGAGCACT,ACAGAGCCCA,ACAGAGCCCC,ACAGAGCCCG,ACAGAGCCCT,ACAGAGGACA,ACAGAGGACC,ACAGAGGACG,ACAGAGGACT,ACAGAGGCCA,ACAGAGGCCC,ACAGAGGCCG,ACAGAGGCCT,ACAGAGTACA,ACAGAGTACC,ACAGAGTACG,ACAGAGTACT,ACAGAGTCCA,ACAGAGTCCC,ACAGAGTCCG,ACAGAGTCCT,ACAGCAA,ACAGCAC,ACAGCACA,ACAGCACC,ACAGCACG,ACAGCACT,ACAGCAG,ACAGCAT,ACAGCCCA,ACAGCCCC,ACAGCCCG,ACAGCCCT,ACAGGAA,ACAGGAC,ACAGGACA,ACAGGACC,ACAGGACG,ACAGGACT,ACAGGAG,ACAGGAT,ACAGGCCA,ACAGGCCC,ACAGGCCG,ACAGGCCT,ACAGTAA,ACAGTAC,ACAGTACA,ACAGTACC,ACAGTACG,ACAGTACT,ACAGTAG,ACAGTAT,ACAGTCCA,ACAGTCCC,ACAGTCCG,ACAGTCCT,AGAAAAA,AGAAAAC,AGAAAACA,AGAAAACC,AGAAAACG,AGAAAACT,AGAAAAG,AGAAAAT,AGAAACCA,AGAAACCC,AGAAACCG,AGAAACCT,AGAACAA,AGAACAC,AGAACACA,AGAACACC,AGAACACG,AGAACACT,AGAACAG,AGAACAT,AGAACCCA,AGAACCCC,AGAACCCG,AGAACCCT,AGAAGAA,AGAAGAC,AGAAGACA,AGAAGACC,AGAAGACG,AGAAGACT,AGAAGAG,AGAAGAT,AGAAGCCA,AGAAGCCC,AGAAGCCG,AGAAGCCT,AGAATAA,AGAATAC,AGAATACA,AGAATACC,AGAATACG,AGAATACT,AGAATAG,AGAATAT,AGAATCCA,AGAATCCC,AGAATCCG,AGAATCCT,AGAGAAA,AGAGAAAACA,AGAGAAAACC,AGAGAAAACG,AGAGAAAACT,AGAGAAACCA,AGAGAAACCC,AGAGAAACCG,AGAGAAACCT,AGAGAAC,AGAGAACA,AGAGAACACA,AGAGAACACC,AGAGAACACG,AGAGAACACT,AGAGAACC,AGAGAACCCA,AGAGAACCCC,AGAGAACCCG,AGAGAACCCT,AGAGAACG,AGAGAACT,AGAGAAG,AGAGAAGACA,AGAGAAGACC,AGAGAAGACG,AGAGAAGACT,AGAGAAGCCA,AGAGAAGCCC,AGAGAAGCCG,AGAGAAGCCT,AGAGAAT,AGAGAATACA,AGAGAATACC,AGAGAATACG,AGAGAATACT,AGAGAATCCA,AGAGAATCCC,AGAGAATCCG,AGAGAATCCT,AGAGACCA,AGAGACCC,AGAGACCG,AGAGACCT,AGAGAGAACA,AGAGAGAACC,AGAGAGAACG,AGAGAGAACT,AGAGAGACCA,AGAGAGACCC,AGAGAGACCG,AGAGAGACCT,AGAGAGCACA,AGAGAGCACC,AGAGAGCACG,AGAGAGCACT,AGAGAGCCCA,AGAGAGCCCC,AGAGAGCCCG,AGAGAGCCCT,AGAGAGGACA,AGAGAGGACC,AGAGAGGACG,AGAGAGGACT,AGAGAGGCCA,AGAGAGGCCC,A