aviary icon indicating copy to clipboard operation
aviary copied to clipboard

Suggestion to include leftover single reads in recover

Open Anna-MarieSeelen opened this issue 7 months ago • 2 comments

Hi Aviary developers,

We use Aviary a lot in our department, mainly for binning metagenomes. Right now, we handle quality control, trimming, and assembly outside of Aviary, and we usually include leftover single reads from trimming in the assembly. As far as I know, Aviary doesn’t currently let you include these single reads in the binning step. It would be great to have an option to add them along with the paired reads. This would help match the coverage calculations for the binners to our current assembly strategy.

I think this would be a great addition, but I’d love to hear your thoughts.

Best,

Anna

Anna-MarieSeelen avatar May 22 '25 12:05 Anna-MarieSeelen

Hi Anna,

You are correct about Aviary's features there. We typically use un-qc'd reads for this, so there are no single reads. Have you observed that method to be sub-optimal? Thanks.

wwood avatar May 23 '25 05:05 wwood

Hi Ben,

Thank you for your interest. I don’t have hard numbers yet (maybe in the future), but based on the kind of data we usually work with, I think for us its better to use trimmed reads both for the assembly and for calculating coverage during binning.

For the assembly, I think trimming is important because our raw reads often still contain leftover adapter sequences and polyG tails. If we leave them in, they can confuse the assembler. When the same artificial sequence shows up in lots of reads, the assembler might not be able to figure out how to correctly connect the real genomic parts, which could lead to errors in the assembly. We also include single reads (unpaired reads) in the assembly to make use of as much data as possible.

For coverage calculation during binning, it’s best to use the same trimmed reads we used for assembly. If we use untrimmed reads instead, some might not map correctly to the contigs—especially if they still contain adapters or polyG tails. This could make the coverage values inaccurate, although I don’t know how big this effect would be. Someone also opened an issue on something similar on the minimap2 github.

But I think its save to say the mapping will most likely be faster with trimmed reads because there’s less noise in the data. There was another issue opened on the minimap2 github about that.

The number of badly mapped reads and the resulting effect on the coverage values might be small, but I can’t be sure, so I would rather map the trimmed reads.

What would be your view on this?

Best,

Anna

Anna-MarieSeelen avatar May 28 '25 14:05 Anna-MarieSeelen