hifiasm icon indicating copy to clipboard operation
hifiasm copied to clipboard

small and lowly covered contig filtering

Open chklopp opened this issue 2 years ago • 2 comments

When I extract the contig length and coverage from the p_ctg gfa file and plot them. A small portion of the contigs are long (Mb or larger) and have the expected coverage and most of the contigs are small (less than 100Kb) with very low coverage. Where do these contigs originate from? Should I remove those small and lowly covered contigs before scaffolding?

chklopp avatar May 05 '22 07:05 chklopp

Could you blast a few of them? I guess most of them might be cDNA and other highly repetitive regions. Assemblers tend to assemble them into multiple copies. If yes, it is ok to remove them.

chhylp123 avatar May 05 '22 11:05 chhylp123

I've aligned 559 contigs smaller than 100kb on NR (diamond blastx) and 451 had proteic hits which all come from species phylogenetically related to the assembled species.

chklopp avatar May 06 '22 06:05 chklopp