apps-scripts
apps-scripts copied to clipboard
Multigene families : trimLowercaseContigs.py
Dear Sarah,
I apologize in advance for much bothering you right now, but I am in the final stages of my phased diploid genome assembly.
I used your script to remove contigs with more than 50% of their bases non-polished. I ran this script on primary contigs and haplotigs separately (29 contigs / 1825 removed, 83 haplotigs / 2901 removed). Then, I took the contigs/haplotigs that the script removes and I launched blastx on it, to see what it corresponds to.
For primary contigs, these are mostly repeated regions, functions that do not interest me, or viruses / bacteria. But for haplotigs, I was quite surprised at the results. A significant portion of the haplotigs that are removed correspond to multigene families. And of course, I don't want to delete them.
Is this due to the fact that the reads multimapped and so that during the polishing phase, some haplotigs are not properly covered and FALCON doesn't polish them?
Do you think I have to manually sort haplotigs that correspond to a multigenic family and put them back in the haplotig file I keep?
Thank you very much for your help. Best, Amandine