apps-scripts icon indicating copy to clipboard operation
apps-scripts copied to clipboard

losing lots of haplotigs

Open dcopetti opened this issue 6 years ago • 2 comments

Hello, I am using this script on an Unzipped assembly (1.3 Gb in total) with default settings (50% of bases to be out). The resulting fasta has many less sequences (5449 vs 7905) and only 65 Mb less (1295 Mb now). I would be fine with this so few data less, but I see that in some cases the sequences to be dropped are the majority of haplotigs of a primary contig and have very low coverage. This is an example:

000198F .       region  1       100000  0.00    +       .       cov=2,30,45;cov2=30.526,7.517;gaps=0,0
000198F .       region  100001  200000  0.00    +       .       cov=7,40,58;cov2=35.509,15.567;gaps=0,0
000198F .       region  200001  300000  0.00    +       .       cov=23,39,63;cov2=39.233,7.091;gaps=0,0
000198F .       region  300001  400000  0.00    +       .       cov=19,34,51;cov2=34.301,7.204;gaps=0,0
000198F .       region  400001  500000  0.00    +       .       cov=22,32,50;cov2=33.162,5.386;gaps=0,0
000198F .       region  500001  600000  0.00    +       .       cov=20,30,47;cov2=30.353,5.867;gaps=0,0
000198F .       region  600001  700000  0.00    +       .       cov=27,45,62;cov2=44.970,6.577;gaps=0,0
000198F .       region  700001  800000  0.00    +       .       cov=23,36,60;cov2=36.940,6.550;gaps=0,0
000198F .       region  800001  900000  0.00    +       .       cov=29,52,72;cov2=50.448,9.843;gaps=0,0
000198F .       region  900001  1000000 0.00    +       .       cov=15,30,57;cov2=31.219,8.639;gaps=0,0
000198F .       region  1000001 1100000 0.00    +       .       cov=24,40,56;cov2=38.645,7.794;gaps=0,0
000198F .       region  1100001 1200000 0.00    +       .       cov=9,39,60;cov2=36.081,14.970;gaps=0,0
000198F .       region  1200001 1305962 0.00    +       .       cov=0,71,114;cov2=65.461,29.555;gaps=1,1415

Haplotigs I would keep:

000198F_001     .       region  1       100000  0.00    +       .       cov=0,37,60;cov2=37.897,11.139;gaps=1,218
000198F_001     .       region  100001  200000  0.00    +       .       cov=25,45,58;cov2=43.812,6.803;gaps=0,0
000198F_001     .       region  200001  377557  0.00    +       .       cov=1,35,52;cov2=32.771,11.745;gaps=0,0
000198F_005     .       region  1       10000   0.00    +       .       cov=0,5,13;cov2=5.601,4.808;gaps=1,1747
000198F_005     .       region  10001   20000   0.00    +       .       cov=12,18,23;cov2=17.811,3.420;gaps=0,0
000198F_005     .       region  20001   30000   0.00    +       .       cov=10,23,27;cov2=20.386,5.018;gaps=0,0
000198F_005     .       region  30001   40000   0.00    +       .       cov=1,2,10;cov2=3.544,2.454;gaps=0,0
000198F_005     .       region  40001   54133   0.00    +       .       cov=0,4,7;cov2=4.076,1.795;gaps=1,736
000198F_007     .       region  1       10000   0.00    +       .       cov=0,16,25;cov2=15.050,6.948;gaps=1,3
000198F_007     .       region  10001   20000   0.00    +       .       cov=24,38,44;cov2=36.142,5.420;gaps=0,0
000198F_007     .       region  20001   30000   0.00    +       .       cov=26,36,40;cov2=33.973,4.735;gaps=0,0
000198F_007     .       region  30001   44674   0.00    +       .       cov=1,12,31;cov2=15.004,9.265;gaps=0,0

Haplotigs to remove:

000198F_002     .       region  1       5000    0.00    +       .       cov=0,5,6;cov2=4.346,1.405;gaps=1,245
000198F_002     .       region  5001    10000   0.00    +       .       cov=4,7,9;cov2=6.964,1.279;gaps=0,0
000198F_002     .       region  10001   15000   0.00    +       .       cov=2,3,4;cov2=2.851,0.626;gaps=0,0
000198F_002     .       region  15001   23832   0.00    +       .       cov=0,1,2;cov2=1.151,0.501;gaps=1,543
000198F_003     .       region  1       10000   0.00    +       .       cov=0,23,33;cov2=19.876,9.618;gaps=1,89
000198F_003     .       region  10001   20000   0.00    +       .       cov=21,32,39;cov2=31.287,4.271;gaps=0,0
000198F_003     .       region  20001   30000   0.00    +       .       cov=2,4,21;cov2=7.073,5.898;gaps=0,0
000198F_003     .       region  30001   40596   0.00    +       .       cov=1,3,6;cov2=3.109,1.130;gaps=0,0
000198F_004     .       region  1       5000    0.00    +       .       cov=0,1,2;cov2=0.667,0.668;gaps=3,2225
000198F_004     .       region  5001    10000   0.00    +       .       cov=2,4,6;cov2=4.150,1.239;gaps=0,0
000198F_004     .       region  10001   15000   0.00    +       .       cov=5,6,7;cov2=6.286,0.455;gaps=0,0
000198F_004     .       region  15001   22055   0.00    +       .       cov=0,5,7;cov2=4.119,2.193;gaps=1,818
000198F_006     .       region  1       5000    0.00    +       .       cov=0,0,1;cov2=0.459,0.498;gaps=1,2707
000198F_006     .       region  5001    10000   0.00    +       .       cov=1,1,2;cov2=1.060,0.237;gaps=0,0
000198F_006     .       region  10001   15000   0.00    +       .       cov=2,3,5;cov2=2.937,0.773;gaps=0,0
000198F_006     .       region  15001   20148   0.00    +       .       cov=0,2,3;cov2=1.702,1.286;gaps=1,1680
000198F_008     .       region  1       5000    0.00    +       .       cov=0,10,15;cov2=10.068,2.809;gaps=1,72
000198F_008     .       region  5001    10000   0.00    +       .       cov=15,18,21;cov2=18.249,1.234;gaps=0,0
000198F_008     .       region  10001   15000   0.00    +       .       cov=7,12,17;cov2=11.872,2.082;gaps=0,0
000198F_008     .       region  15001   20000   0.00    +       .       cov=2,3,7;cov2=4.031,2.002;gaps=0,0
000198F_008     .       region  20001   29084   0.00    +       .       cov=0,1,4;cov2=0.790,1.003;gaps=3,4217

Which makes sense - the dropped haplotigs have very low coverage. But why were they assembled, then? if they are just spurious reads, how did they even make it through the assembly (run with min_cov=2 or 4)?

At the same time, I would remove 171 primary contigs, that usually have high ctg ID number and low coverage again:

001665F .       region  1       10000   0.00    +       .       cov=0,14,16;cov2=11.141,4.683;gaps=1,232
001665F .       region  10001   20000   0.00    +       .       cov=7,12,15;cov2=11.800,2.034;gaps=0,0
001665F .       region  20001   30000   0.00    +       .       cov=4,6,8;cov2=5.690,1.085;gaps=0,0
001665F .       region  30001   44399   0.00    +       .       cov=1,7,11;cov2=7.110,2.019;gaps=0,0

or better this:

001527F .       region  1       10000   0.00    +       .       cov=1,13,45;cov2=18.162,15.221;gaps=0,0
001527F .       region  10001   20000   0.00    +       .       cov=12,27,34;cov2=25.351,5.830;gaps=0,0
001527F .       region  20001   30000   0.00    +       .       cov=2,8,12;cov2=6.611,3.566;gaps=0,0
001527F .       region  30001   40000   0.00    +       .       cov=1,3,5;cov2=3.303,1.101;gaps=0,0
001527F .       region  40001   53128   0.00    +       .       cov=0,1,4;cov2=1.102,1.374;gaps=2,6331

how about instead trim the regions with lowercase bases? In this last contig, the ~25 kb of well covered sequence may be a real (allelic) contig. Or, did you try seeing if these low coverage regions are "resurrected" to better quality after aligning Illumina reads and running Pilon? (I will try that now) Thanks, Dario

dcopetti avatar Sep 05 '17 12:09 dcopetti