bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

Expanding dereplicated reads

Open tkchafin opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please specify.

Not really a problem necessarily, but more of a question. Is there currently any way to use information for de-replicated reads when calculating depths via mpileup? The context is that I am working on a serverless pipeline for a de novo assembly of a reduced-representation dataset, with steps running in AWS lambda's, so to keep file sizes and runtimes as small as possible, I'm doing all of the assembly steps using dereplicated reads (with the read counts saved in the headers as ;size=X). Currently I am then re-expanding these to create an intermediate file to send through mpileup before running bcftools call, however I'm wondering if there is a way within samtools to use this count information directly for calculating depths?

Describe the solution you would like.

It would be ideal if samtools could directly parse the count information, for example via a user-defined tag, or by parsing the USEARCH/VSEARCH style size= tags from the headers. If not currently possible, I would be happy to work on it and submit a pull request, but would appreciate some pointers on the best way to proceed.

Thanks!

tkchafin avatar Sep 09 '23 10:09 tkchafin

Sorry, I just noticed that the vcf outputs had been moved to bcftools mpileup, so I should have posted this over there..

tkchafin avatar Sep 09 '23 20:09 tkchafin

Mm, I don't understand the terminology you are using (for example what do you mean by "de-replication", "re-expansion", the reference to USEARCH, etc) and don't know if you are referring to a BAM file header or to a VCF file header. Can you please provide more details.

pd3 avatar Sep 13 '23 14:09 pd3