pyScaf
pyScaf copied to clipboard
Too large size of gaps got after long-reads scaffolding
I have get as long as ~500K bp gaps when using PacBio reads for scaffolding. However, the long reads are not longer than 80K. I wonder how such long gaps opened. There might be an error on calculating gap in the script at Line 802 (gap = dist - pos1 + pos2). As pos1 and pos2 should be the overhang of target2 and target1, the expression should be "gap = dist - (pos1 + pos2)" and "overhang = pos1 + pos2"? Additionly, there might be some over-scaffolding that many contigs seemed with large overlap were linked directly (without any check such as whether the contigs overlapped actually). Here are head lines in file scaffolds.longreads.1.fa.tsv:
name size no. of contigs ordered contigs contig orientations (0-forward; 1-reverse) gap sizes (negative gap size = adjacent contigs are overlapping)
scaffold00001 37101958 21 tig00007288|arrow tig00000647|arrow tig00007807|arrow tig00007204|arrow tig00007202|arrow tig00007056|arrow tig00007054|arrow tig00007746|arrow tig00003676|arrow tig00007493|arrow tig00007487|arrow tig00000702|arrow tig00000363|arrow tig00007322|arrow tig00000346|arrow tig00000967|arrow tig00000110|arrow tig00001511|arrow tig00007211|arrow tig00000161|arrow tig00007841|arrow 0 1 1 1 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 -8365 -1532317 -1484 -774 -1085 -175 46 -166587 -571737 -1200624 1658 608 -102546 121 335 -571031 -1593587 -5191766 -1584315 645 0 scaffold00002 27485094 6 tig00000652|arrow tig00000682|arrow tig00006972|arrow tig00007248|arrow tig00006941|arrow tig00000050|arrow 1 1 0 0 0 0 -262 -1670 582026 -807298 -262598 0 scaffold00003 19217952 10 tig00007669|arrow tig00000631|arrow tig00007413|arrow tig00007205|arrow tig00007206|arrow tig00007557|arrow tig00000626|arrow tig00007674|arrow tig00007220|arrow tig00006956|arrow 1 1 0 0 0 1 0 1 1 0 -6695 -918253 1084 1152 -2125124 -649917 -444017 -75582 -695 0
Here is the command: redundans.py -f $SEQS -o nonredundant5 -l $LONGREADS -identity 0.9 --overlap 0.9 --minLength 200 -t 30 --resume -v The '-f' is all contigs assembled by CANU without any gap and the '-l' is corrected Pacbio reads by CANU. Before scaffolding, the ratio of completed BUSCOs is ~92%, and that is ~90% after scaffolding.