Fragmented back-mapping in a 2-sample graph
Hello Heng, After building a graph from two high quality assemblies I am in the process of tracing the paths through the graph. When mapping back the original sequences the first one, which is my reference, maps back quite nicely, giving me a single path retracing the entire chromosome. The second assembly, however, ends up mapping in a very fragmented way. When looking at the gaf file I notice cases where a much longer alignment is apparent but the program breaks it up.
Below is an example. The first mapping terminates at segment s485 and the second mapping begins with it at almost the exact position on the query. The start position on the second path is not 1 but 388910, which suggests that perhaps there is a large deletion on the query sequence. Is this the reason for the split? Following it there are two additional short and weak hits which are also labeled as Primary, but I'm not sure they are relevant to the split.
l used the following parameters: -x asm --vc --min-cov-blen=10000. Could you please help me better understand this behavior?
smpl_Per1.pseudomolecule_1 64599618 19670355 21354424 + >s447>s448>s449>s2259>s451>s2260>s453>s2261>s455>s2262>s457>s2263>s458>s460>s2264>s461>s463>s2265>s465>s2266>s466>s468>s470>s472>s474>s2267>s476>s478>s479>s480>s481>s482>s2270>s484>s485 2405973 184122 2015775 1176009 1998796 60 tp:A:P cm:i:200540 s1:i:765412 s2:i:21952 dv:f:0.0216
smpl_Per1.pseudomolecule_1 64599618 21354444 22017736 + >s485>s487>s2877>s489>s490>s2879>s492>s2880>s494>s2881>s495>s497>s499>s501>s503>s2882>s505>s2883>s506>s2884>s508>s2885>s510>s512>s2886>s514>s2887>s515>s2888>s517>s519>s2889>s521>s2890>s523 1434328 388910 1181586 599124 795822 60 tp:A:P cm:i:101971 s1:i:483101 s2:i:1766 dv:f:0.0084
Sorry, I clipped off the third and forth alignments. Here is the full set. Thanks!
smpl_Per1.pseudomolecule_1 64599618 19670355 21354424 + >s447>s448>s449>s2259>s451>s2260>s453>s2261>s455>s2262>s457>s2263>s458>s460>s2264>s461>s463>s2265>s465>s2266>s466>s468>s470>s472>s474>s2267>s476>s478>s479>s480>s481>s482>s2270>s484>s485 2405973 184122 2015775 1176009 1998796 60 tp:A:P cm:i:200540 s1:i:765412 s2:i:21952 dv:f:0.0216
smpl_Per1.pseudomolecule_1 64599618 21354444 22017736 + >s485>s487>s2877>s489>s490>s2879>s492>s2880>s494>s2881>s495>s497>s499>s501>s503>s2882>s505>s2883>s506>s2884>s508>s2885>s510>s512>s2886>s514>s2887>s515>s2888>s517>s519>s2889>s521>s2890>s523 1434328 388910 1181586 599124 795822 60 tp:A:P cm:i:101971 s1:i:483101 s2:i:1766 dv:f:0.0084
smpl_Per1.pseudomolecule_1 64599618 31820127 31823848 + <s485 528519 384850 388540 1497 3739 14 tp:A:P cm:i:157 s1:i:1285 s2:i:1190 dv:f:0.0739
smpl_Per1.pseudomolecule_1 64599618 31836689 31839332 + >s485 528519 460127 462753 1150 2646 13 tp:A:P cm:i:110 s1:i:1053 s2:i:0 dv:f:0.0775
In the asm/ggs mode, minigraph by default only aligns through gaps shorter than 100kb. Is the gap longer than 100kb?
If there is a large gap/deletion, shouldn't it have resulted a split segment during the graph construction? Here we have the first mapping ending about 138kb into s485, and the second mapping starting 389kb into s485. So that's a gap of about 251kb. There are no other long mappings in this region (not counting the 3rth and 4th hits as they're short). Can you suggest reasons for why this segment didn't split during graph construction?
Can you suggest reasons for why this segment didn't split during graph construction?
How do you know this? Probably this region is not used for graph construction.
I see. And what would be the possible reasons for a large region like this to not be used in the construction? Just to clarify, the graph I'm mapping to was constructed using two samples: the reference and this smpl_Per1.pseudomolecule_1. So I would expect this region to be split further because of this large gap/difference. Here's how this segment appears in the GFA (I'm skipping the sequence in the S line:
S s485 LN:i:528519 SN:Z:Ref_CHR1 SO:i:23602459 SR:i:0
L s484 + s485 + 0M SR:i:0 L1:i:10965 L2:i:528519
L s485 + s487 + 0M SR:i:1 L1:i:528519 L2:i:8428
L s485 + s486 + 0M SR:i:0 L1:i:528519 L2:i:1369
L s2271 + s485 + 0M SR:i:1 L1:i:124 L2:i:528519
These L-lines don't show whether that 251kb region is used in graph construction.
what would be the possible reasons for a large region like this to not be used in the construction?
The default cutoff is 100kb. A larger threshold increases running time and might (I am not sure) affect alignment quality. You may try to increase the threshold with -l300k.
Thanks, I'll try that. So the take-home message is that if a large (>100kb) region cannot be "chained" during the construction stage, it will not affect the graph and not induce any segements. Would this be an accurate statement?