spacegraphcats icon indicating copy to clipboard operation
spacegraphcats copied to clipboard

Estimate the percent of kmers and reads captured by a query?

Open taylorreiter opened this issue 6 years ago • 2 comments

As I've used spacegraphcats more, I keep returning to the idea that it would be useful to know the percent of k-mers and reads captured by a query or set of queries.

For example, with the Hu SB1 dataset, we ran a set of queries using genome bins generated by the original analysis of the data. I'm curious how many k-mers in the cDBG were not captured by those queries, and likewise the number of reads that are left over in the original data after subtracting those that are returned for the queries. In essence, I would like to know the metagenome coverage for a query without performing mapping.

taylorreiter avatar Jan 14 '20 17:01 taylorreiter

I guess I could make a gather database that contains all of the queries, and then run the original sample against this database to get the approx. portion that is not in the queries, and approx portion in each query.

taylorreiter avatar Jan 14 '20 18:01 taylorreiter

It seems like we could provide this information by tracking total number of reads in the sample (which can then be used to figure out how many reads were NOT retrieved for a query), and the same for k-mers. Then you could use an aggregate query (all your queries combined) to determine these numbers.

ctb avatar Jan 16 '20 20:01 ctb