apps-scripts
apps-scripts copied to clipboard
losing lots of haplotigs
Hello, I am using this script on an Unzipped assembly (1.3 Gb in total) with default settings (50% of bases to be out). The resulting fasta has many less sequences (5449 vs 7905) and only 65 Mb less (1295 Mb now). I would be fine with this so few data less, but I see that in some cases the sequences to be dropped are the majority of haplotigs of a primary contig and have very low coverage. This is an example:
000198F . region 1 100000 0.00 + . cov=2,30,45;cov2=30.526,7.517;gaps=0,0
000198F . region 100001 200000 0.00 + . cov=7,40,58;cov2=35.509,15.567;gaps=0,0
000198F . region 200001 300000 0.00 + . cov=23,39,63;cov2=39.233,7.091;gaps=0,0
000198F . region 300001 400000 0.00 + . cov=19,34,51;cov2=34.301,7.204;gaps=0,0
000198F . region 400001 500000 0.00 + . cov=22,32,50;cov2=33.162,5.386;gaps=0,0
000198F . region 500001 600000 0.00 + . cov=20,30,47;cov2=30.353,5.867;gaps=0,0
000198F . region 600001 700000 0.00 + . cov=27,45,62;cov2=44.970,6.577;gaps=0,0
000198F . region 700001 800000 0.00 + . cov=23,36,60;cov2=36.940,6.550;gaps=0,0
000198F . region 800001 900000 0.00 + . cov=29,52,72;cov2=50.448,9.843;gaps=0,0
000198F . region 900001 1000000 0.00 + . cov=15,30,57;cov2=31.219,8.639;gaps=0,0
000198F . region 1000001 1100000 0.00 + . cov=24,40,56;cov2=38.645,7.794;gaps=0,0
000198F . region 1100001 1200000 0.00 + . cov=9,39,60;cov2=36.081,14.970;gaps=0,0
000198F . region 1200001 1305962 0.00 + . cov=0,71,114;cov2=65.461,29.555;gaps=1,1415
Haplotigs I would keep:
000198F_001 . region 1 100000 0.00 + . cov=0,37,60;cov2=37.897,11.139;gaps=1,218
000198F_001 . region 100001 200000 0.00 + . cov=25,45,58;cov2=43.812,6.803;gaps=0,0
000198F_001 . region 200001 377557 0.00 + . cov=1,35,52;cov2=32.771,11.745;gaps=0,0
000198F_005 . region 1 10000 0.00 + . cov=0,5,13;cov2=5.601,4.808;gaps=1,1747
000198F_005 . region 10001 20000 0.00 + . cov=12,18,23;cov2=17.811,3.420;gaps=0,0
000198F_005 . region 20001 30000 0.00 + . cov=10,23,27;cov2=20.386,5.018;gaps=0,0
000198F_005 . region 30001 40000 0.00 + . cov=1,2,10;cov2=3.544,2.454;gaps=0,0
000198F_005 . region 40001 54133 0.00 + . cov=0,4,7;cov2=4.076,1.795;gaps=1,736
000198F_007 . region 1 10000 0.00 + . cov=0,16,25;cov2=15.050,6.948;gaps=1,3
000198F_007 . region 10001 20000 0.00 + . cov=24,38,44;cov2=36.142,5.420;gaps=0,0
000198F_007 . region 20001 30000 0.00 + . cov=26,36,40;cov2=33.973,4.735;gaps=0,0
000198F_007 . region 30001 44674 0.00 + . cov=1,12,31;cov2=15.004,9.265;gaps=0,0
Haplotigs to remove:
000198F_002 . region 1 5000 0.00 + . cov=0,5,6;cov2=4.346,1.405;gaps=1,245
000198F_002 . region 5001 10000 0.00 + . cov=4,7,9;cov2=6.964,1.279;gaps=0,0
000198F_002 . region 10001 15000 0.00 + . cov=2,3,4;cov2=2.851,0.626;gaps=0,0
000198F_002 . region 15001 23832 0.00 + . cov=0,1,2;cov2=1.151,0.501;gaps=1,543
000198F_003 . region 1 10000 0.00 + . cov=0,23,33;cov2=19.876,9.618;gaps=1,89
000198F_003 . region 10001 20000 0.00 + . cov=21,32,39;cov2=31.287,4.271;gaps=0,0
000198F_003 . region 20001 30000 0.00 + . cov=2,4,21;cov2=7.073,5.898;gaps=0,0
000198F_003 . region 30001 40596 0.00 + . cov=1,3,6;cov2=3.109,1.130;gaps=0,0
000198F_004 . region 1 5000 0.00 + . cov=0,1,2;cov2=0.667,0.668;gaps=3,2225
000198F_004 . region 5001 10000 0.00 + . cov=2,4,6;cov2=4.150,1.239;gaps=0,0
000198F_004 . region 10001 15000 0.00 + . cov=5,6,7;cov2=6.286,0.455;gaps=0,0
000198F_004 . region 15001 22055 0.00 + . cov=0,5,7;cov2=4.119,2.193;gaps=1,818
000198F_006 . region 1 5000 0.00 + . cov=0,0,1;cov2=0.459,0.498;gaps=1,2707
000198F_006 . region 5001 10000 0.00 + . cov=1,1,2;cov2=1.060,0.237;gaps=0,0
000198F_006 . region 10001 15000 0.00 + . cov=2,3,5;cov2=2.937,0.773;gaps=0,0
000198F_006 . region 15001 20148 0.00 + . cov=0,2,3;cov2=1.702,1.286;gaps=1,1680
000198F_008 . region 1 5000 0.00 + . cov=0,10,15;cov2=10.068,2.809;gaps=1,72
000198F_008 . region 5001 10000 0.00 + . cov=15,18,21;cov2=18.249,1.234;gaps=0,0
000198F_008 . region 10001 15000 0.00 + . cov=7,12,17;cov2=11.872,2.082;gaps=0,0
000198F_008 . region 15001 20000 0.00 + . cov=2,3,7;cov2=4.031,2.002;gaps=0,0
000198F_008 . region 20001 29084 0.00 + . cov=0,1,4;cov2=0.790,1.003;gaps=3,4217
Which makes sense - the dropped haplotigs have very low coverage. But why were they assembled, then? if they are just spurious reads, how did they even make it through the assembly (run with min_cov=2 or 4)?
At the same time, I would remove 171 primary contigs, that usually have high ctg ID number and low coverage again:
001665F . region 1 10000 0.00 + . cov=0,14,16;cov2=11.141,4.683;gaps=1,232
001665F . region 10001 20000 0.00 + . cov=7,12,15;cov2=11.800,2.034;gaps=0,0
001665F . region 20001 30000 0.00 + . cov=4,6,8;cov2=5.690,1.085;gaps=0,0
001665F . region 30001 44399 0.00 + . cov=1,7,11;cov2=7.110,2.019;gaps=0,0
or better this:
001527F . region 1 10000 0.00 + . cov=1,13,45;cov2=18.162,15.221;gaps=0,0
001527F . region 10001 20000 0.00 + . cov=12,27,34;cov2=25.351,5.830;gaps=0,0
001527F . region 20001 30000 0.00 + . cov=2,8,12;cov2=6.611,3.566;gaps=0,0
001527F . region 30001 40000 0.00 + . cov=1,3,5;cov2=3.303,1.101;gaps=0,0
001527F . region 40001 53128 0.00 + . cov=0,1,4;cov2=1.102,1.374;gaps=2,6331
how about instead trim the regions with lowercase bases? In this last contig, the ~25 kb of well covered sequence may be a real (allelic) contig. Or, did you try seeing if these low coverage regions are "resurrected" to better quality after aligning Illumina reads and running Pilon? (I will try that now) Thanks, Dario
Hi Dario, "Which makes sense - the dropped haplotigs have very low coverage. But why were they assembled, then? if they are just spurious reads, how did they even make it through the assembly (run with min_cov=2 or 4)?"
It's possible that these haplotigs represent artifacts or misassemblies, possibly driven by repeats, but its not something we have looked into in detail at this point. For the example you gave, the 5 removed haplotigs for 000198F are shorter than the retained 3. It would be interesting to see how all of the haplotigs align to the primary. You could try my bash script which uses samtools and mummer to generate alignments of all haplotigs to their primary contig and then plot in assemblytics. My hunch is some of the shorter haplotigs are nested within 000198F_001.
Regarding min_cov, this can refer to the preassembly process (falcon_sense_option) which determines the depth of raw read coverage on seed reads required to call consensus versus split the seed read. But "min_cov" is also used in layout filtering and refers to the number of overlaps between preads. So neither of these parameters really capture raw read depth on contigs. Its also worth noting that the raw read coverage you are looking at is from the polishing process which is totally distinct from the process used by FALCON-Unzip, but is still useful for assessing raw support in the assembly.
"how about instead trim the regions with lowercase bases? In this last contig, the ~25 kb of well covered sequence may be a real (allelic) contig. Or, did you try seeing if these low coverage regions are "resurrected" to better quality after aligning Illumina reads and running Pilon? (I will try that now)"
I would be concerned about trimming LC bases in the middle of contigs. Also, it is normal for raw read coverage to drop at the end of contigs so if removing LC sequence at contig ends, I would worry you would lose a lot of sequence. If you polish with Pilon, be sure to use a "random best" mapping strategy to avoid multiply mapped reads. Would be interested in your results.
If you are concerned about overly aggressive filtering, you could map transcripts to the contigs and only remove those that both have low coverage AND no genes.
Sarah
Thanks Sarah, I run contig 198 and I got a mixed result: like you thought, haplotigs 2, 3 and 6 are contained in 1. But 4 and 8 - that will be discarded for the lowercase/coverage issue - are on a different region of the primary contig. Trying to save them, we could edit the lowercase fraction to keep, but that will save only haplotig 8, #4 is all its length around coverage 4-5. So unless we manually inspect all the contigs, I guess there is not a straightforward way to split clearly real haplotigs from artifacts.
I agree on not trimming lowercase regions inside a contig, and after Pilon I see that the fraction of lowercase bases in all contigs is very low (mostly below 1%, sometimes up to 9%). But I don't know how Pilon assigns ATGC vs atgc.
Lastly, keeping a contig just because it has a gene does not seem a good criterion for me, it may create "fake paralogs" with deep biological implications. Unless that gene is unique in the assembly, probably.
If biology was perfect, it would be so boring....