hifiasm
hifiasm copied to clipboard
small and lowly covered contig filtering
When I extract the contig length and coverage from the p_ctg gfa file and plot them. A small portion of the contigs are long (Mb or larger) and have the expected coverage and most of the contigs are small (less than 100Kb) with very low coverage. Where do these contigs originate from? Should I remove those small and lowly covered contigs before scaffolding?
Could you blast a few of them? I guess most of them might be cDNA and other highly repetitive regions. Assemblers tend to assemble them into multiple copies. If yes, it is ok to remove them.
I've aligned 559 contigs smaller than 100kb on NR (diamond blastx) and 451 had proteic hits which all come from species phylogenetically related to the assembled species.