apps-scripts
apps-scripts copied to clipboard
losing lots of haplotigs
Hello, I am using this script on an Unzipped assembly (1.3 Gb in total) with default settings (50% of bases to be out). The resulting fasta has many less sequences (5449 vs 7905) and only 65 Mb less (1295 Mb now). I would be fine with this so few data less, but I see that in some cases the sequences to be dropped are the majority of haplotigs of a primary contig and have very low coverage. This is an example:
000198F . region 1 100000 0.00 + . cov=2,30,45;cov2=30.526,7.517;gaps=0,0
000198F . region 100001 200000 0.00 + . cov=7,40,58;cov2=35.509,15.567;gaps=0,0
000198F . region 200001 300000 0.00 + . cov=23,39,63;cov2=39.233,7.091;gaps=0,0
000198F . region 300001 400000 0.00 + . cov=19,34,51;cov2=34.301,7.204;gaps=0,0
000198F . region 400001 500000 0.00 + . cov=22,32,50;cov2=33.162,5.386;gaps=0,0
000198F . region 500001 600000 0.00 + . cov=20,30,47;cov2=30.353,5.867;gaps=0,0
000198F . region 600001 700000 0.00 + . cov=27,45,62;cov2=44.970,6.577;gaps=0,0
000198F . region 700001 800000 0.00 + . cov=23,36,60;cov2=36.940,6.550;gaps=0,0
000198F . region 800001 900000 0.00 + . cov=29,52,72;cov2=50.448,9.843;gaps=0,0
000198F . region 900001 1000000 0.00 + . cov=15,30,57;cov2=31.219,8.639;gaps=0,0
000198F . region 1000001 1100000 0.00 + . cov=24,40,56;cov2=38.645,7.794;gaps=0,0
000198F . region 1100001 1200000 0.00 + . cov=9,39,60;cov2=36.081,14.970;gaps=0,0
000198F . region 1200001 1305962 0.00 + . cov=0,71,114;cov2=65.461,29.555;gaps=1,1415
Haplotigs I would keep:
000198F_001 . region 1 100000 0.00 + . cov=0,37,60;cov2=37.897,11.139;gaps=1,218
000198F_001 . region 100001 200000 0.00 + . cov=25,45,58;cov2=43.812,6.803;gaps=0,0
000198F_001 . region 200001 377557 0.00 + . cov=1,35,52;cov2=32.771,11.745;gaps=0,0
000198F_005 . region 1 10000 0.00 + . cov=0,5,13;cov2=5.601,4.808;gaps=1,1747
000198F_005 . region 10001 20000 0.00 + . cov=12,18,23;cov2=17.811,3.420;gaps=0,0
000198F_005 . region 20001 30000 0.00 + . cov=10,23,27;cov2=20.386,5.018;gaps=0,0
000198F_005 . region 30001 40000 0.00 + . cov=1,2,10;cov2=3.544,2.454;gaps=0,0
000198F_005 . region 40001 54133 0.00 + . cov=0,4,7;cov2=4.076,1.795;gaps=1,736
000198F_007 . region 1 10000 0.00 + . cov=0,16,25;cov2=15.050,6.948;gaps=1,3
000198F_007 . region 10001 20000 0.00 + . cov=24,38,44;cov2=36.142,5.420;gaps=0,0
000198F_007 . region 20001 30000 0.00 + . cov=26,36,40;cov2=33.973,4.735;gaps=0,0
000198F_007 . region 30001 44674 0.00 + . cov=1,12,31;cov2=15.004,9.265;gaps=0,0
Haplotigs to remove:
000198F_002 . region 1 5000 0.00 + . cov=0,5,6;cov2=4.346,1.405;gaps=1,245
000198F_002 . region 5001 10000 0.00 + . cov=4,7,9;cov2=6.964,1.279;gaps=0,0
000198F_002 . region 10001 15000 0.00 + . cov=2,3,4;cov2=2.851,0.626;gaps=0,0
000198F_002 . region 15001 23832 0.00 + . cov=0,1,2;cov2=1.151,0.501;gaps=1,543
000198F_003 . region 1 10000 0.00 + . cov=0,23,33;cov2=19.876,9.618;gaps=1,89
000198F_003 . region 10001 20000 0.00 + . cov=21,32,39;cov2=31.287,4.271;gaps=0,0
000198F_003 . region 20001 30000 0.00 + . cov=2,4,21;cov2=7.073,5.898;gaps=0,0
000198F_003 . region 30001 40596 0.00 + . cov=1,3,6;cov2=3.109,1.130;gaps=0,0
000198F_004 . region 1 5000 0.00 + . cov=0,1,2;cov2=0.667,0.668;gaps=3,2225
000198F_004 . region 5001 10000 0.00 + . cov=2,4,6;cov2=4.150,1.239;gaps=0,0
000198F_004 . region 10001 15000 0.00 + . cov=5,6,7;cov2=6.286,0.455;gaps=0,0
000198F_004 . region 15001 22055 0.00 + . cov=0,5,7;cov2=4.119,2.193;gaps=1,818
000198F_006 . region 1 5000 0.00 + . cov=0,0,1;cov2=0.459,0.498;gaps=1,2707
000198F_006 . region 5001 10000 0.00 + . cov=1,1,2;cov2=1.060,0.237;gaps=0,0
000198F_006 . region 10001 15000 0.00 + . cov=2,3,5;cov2=2.937,0.773;gaps=0,0
000198F_006 . region 15001 20148 0.00 + . cov=0,2,3;cov2=1.702,1.286;gaps=1,1680
000198F_008 . region 1 5000 0.00 + . cov=0,10,15;cov2=10.068,2.809;gaps=1,72
000198F_008 . region 5001 10000 0.00 + . cov=15,18,21;cov2=18.249,1.234;gaps=0,0
000198F_008 . region 10001 15000 0.00 + . cov=7,12,17;cov2=11.872,2.082;gaps=0,0
000198F_008 . region 15001 20000 0.00 + . cov=2,3,7;cov2=4.031,2.002;gaps=0,0
000198F_008 . region 20001 29084 0.00 + . cov=0,1,4;cov2=0.790,1.003;gaps=3,4217
Which makes sense - the dropped haplotigs have very low coverage. But why were they assembled, then? if they are just spurious reads, how did they even make it through the assembly (run with min_cov=2 or 4)?
At the same time, I would remove 171 primary contigs, that usually have high ctg ID number and low coverage again:
001665F . region 1 10000 0.00 + . cov=0,14,16;cov2=11.141,4.683;gaps=1,232
001665F . region 10001 20000 0.00 + . cov=7,12,15;cov2=11.800,2.034;gaps=0,0
001665F . region 20001 30000 0.00 + . cov=4,6,8;cov2=5.690,1.085;gaps=0,0
001665F . region 30001 44399 0.00 + . cov=1,7,11;cov2=7.110,2.019;gaps=0,0
or better this:
001527F . region 1 10000 0.00 + . cov=1,13,45;cov2=18.162,15.221;gaps=0,0
001527F . region 10001 20000 0.00 + . cov=12,27,34;cov2=25.351,5.830;gaps=0,0
001527F . region 20001 30000 0.00 + . cov=2,8,12;cov2=6.611,3.566;gaps=0,0
001527F . region 30001 40000 0.00 + . cov=1,3,5;cov2=3.303,1.101;gaps=0,0
001527F . region 40001 53128 0.00 + . cov=0,1,4;cov2=1.102,1.374;gaps=2,6331
how about instead trim the regions with lowercase bases? In this last contig, the ~25 kb of well covered sequence may be a real (allelic) contig. Or, did you try seeing if these low coverage regions are "resurrected" to better quality after aligning Illumina reads and running Pilon? (I will try that now) Thanks, Dario