krakenuniq icon indicating copy to clipboard operation
krakenuniq copied to clipboard

Discrepant results including viral neighbors references

Open luigra opened this issue 4 years ago • 0 comments

Dear program authors and users,

I ran krakenuniq on a database made only by the RefSeq Viral sequences (krakenuniq-download --db ${outdir} -threads 26 --dust refseq/viral/Any krakenuniq-build --db ${outdir} --kmer-len 31 --threads 26 --taxids-for-genomes --taxids-for-sequences)

In the results I look for species over a threshold k-mer or coverage and I look at the assigned reads to identify the most plausible genome sequence of the identified species. Specifically this example

Picture1

Most of the alphapapillomavirus 7 reads are mapped onto NC_001357.1, so I would consider this a valid genome reference.

I noticed that analysing the same reads using a database that includes also the Genebank viral neighbors (krakenuniq-download --db ${outdir} -threads 26 --dust refseq/viral/Any viral-neighbors krakenuniq-build --db ${outdir} --kmer-len 31 --threads 26 --taxids-for-genomes --taxids-for-sequences) the results are quite different:

Picture2

Indeed for the same species alphapapillomavirus 7 a similar number of reads is identified but there is no sequence on which most of reads are mapped and the sequence NC_001357.1 has very few reads assigned. How this is reconcilable with the previous result?

Do you have any suggestion, am I misinterpreting the results?

Thanks in advance Luigi

luigra avatar Jun 18 '20 16:06 luigra