minigraph Fragmented back-mapping in a 2-sample graph

Hello Heng, After building a graph from two high quality assemblies I am in the process of tracing the paths through the graph. When mapping back the original sequences the first one, which is my reference, maps back quite nicely, giving me a single path retracing the entire chromosome. The second assembly, however, ends up mapping in a very fragmented way. When looking at the gaf file I notice cases where a much longer alignment is apparent but the program breaks it up.

Below is an example. The first mapping terminates at segment s485 and the second mapping begins with it at almost the exact position on the query. The start position on the second path is not 1 but 388910, which suggests that perhaps there is a large deletion on the query sequence. Is this the reason for the split? Following it there are two additional short and weak hits which are also labeled as Primary, but I'm not sure they are relevant to the split.

l used the following parameters: -x asm --vc --min-cov-blen=10000. Could you please help me better understand this behavior?


smpl_Per1.pseudomolecule_1      64599618        19670355        21354424        +       >s447>s448>s449>s2259>s451>s2260>s453>s2261>s455>s2262>s457>s2263>s458>s460>s2264>s461>s463>s2265>s465>s2266>s466>s468>s470>s472>s474>s2267>s476>s478>s479>s480>s481>s482>s2270>s484>s485       2405973   184122  2015775 1176009 1998796 60      tp:A:P  cm:i:200540     s1:i:765412     s2:i:21952      dv:f:0.0216
smpl_Per1.pseudomolecule_1      64599618        21354444        22017736        +       >s485>s487>s2877>s489>s490>s2879>s492>s2880>s494>s2881>s495>s497>s499>s501>s503>s2882>s505>s2883>s506>s2884>s508>s2885>s510>s512>s2886>s514>s2887>s515>s2888>s517>s519>s2889>s521>s2890>s523    1434328   388910  1181586 599124  795822  60      tp:A:P  cm:i:101971     s1:i:483101     s2:i:1766       dv:f:0.0084

Dec 01 '20 20:12 egoltsman

Sorry, I clipped off the third and forth alignments. Here is the full set. Thanks!

smpl_Per1.pseudomolecule_1      64599618        19670355        21354424        +       >s447>s448>s449>s2259>s451>s2260>s453>s2261>s455>s2262>s457>s2263>s458>s460>s2264>s461>s463>s2265>s465>s2266>s466>s468>s470>s472>s474>s2267>s476>s478>s479>s480>s481>s482>s2270>s484>s485       2405973   184122  2015775 1176009 1998796 60      tp:A:P  cm:i:200540     s1:i:765412     s2:i:21952      dv:f:0.0216
smpl_Per1.pseudomolecule_1      64599618        21354444        22017736        +       >s485>s487>s2877>s489>s490>s2879>s492>s2880>s494>s2881>s495>s497>s499>s501>s503>s2882>s505>s2883>s506>s2884>s508>s2885>s510>s512>s2886>s514>s2887>s515>s2888>s517>s519>s2889>s521>s2890>s523    1434328   388910  1181586 599124  795822  60      tp:A:P  cm:i:101971     s1:i:483101     s2:i:1766       dv:f:0.0084
smpl_Per1.pseudomolecule_1      64599618        31820127        31823848        +       <s485   528519  384850  388540  1497    3739    14      tp:A:P  cm:i:157        s1:i:1285       s2:i:1190       dv:f:0.0739
smpl_Per1.pseudomolecule_1      64599618        31836689        31839332        +       >s485   528519  460127  462753  1150    2646    13      tp:A:P  cm:i:110        s1:i:1053       s2:i:0  dv:f:0.0775

Dec 01 '20 20:12 egoltsman

In the asm/ggs mode, minigraph by default only aligns through gaps shorter than 100kb. Is the gap longer than 100kb?

Dec 02 '20 05:12 lh3

If there is a large gap/deletion, shouldn't it have resulted a split segment during the graph construction? Here we have the first mapping ending about 138kb into s485, and the second mapping starting 389kb into s485. So that's a gap of about 251kb. There are no other long mappings in this region (not counting the 3rth and 4th hits as they're short). Can you suggest reasons for why this segment didn't split during graph construction?

Dec 02 '20 19:12 egoltsman

Can you suggest reasons for why this segment didn't split during graph construction?

How do you know this? Probably this region is not used for graph construction.

Dec 02 '20 20:12 lh3

I see. And what would be the possible reasons for a large region like this to not be used in the construction? Just to clarify, the graph I'm mapping to was constructed using two samples: the reference and this smpl_Per1.pseudomolecule_1. So I would expect this region to be split further because of this large gap/difference. Here's how this segment appears in the GFA (I'm skipping the sequence in the S line:

S       s485    LN:i:528519     SN:Z:Ref_CHR1   SO:i:23602459   SR:i:0
L       s484    +       s485    +       0M      SR:i:0  L1:i:10965      L2:i:528519
L       s485    +       s487    +       0M      SR:i:1  L1:i:528519     L2:i:8428
L       s485    +       s486    +       0M      SR:i:0  L1:i:528519     L2:i:1369
L       s2271   +       s485    +       0M      SR:i:1  L1:i:124        L2:i:528519

Dec 02 '20 20:12 egoltsman

These L-lines don't show whether that 251kb region is used in graph construction.

what would be the possible reasons for a large region like this to not be used in the construction?

The default cutoff is 100kb. A larger threshold increases running time and might (I am not sure) affect alignment quality. You may try to increase the threshold with -l300k.

Dec 02 '20 20:12 lh3

Thanks, I'll try that. So the take-home message is that if a large (>100kb) region cannot be "chained" during the construction stage, it will not affect the graph and not induce any segements. Would this be an accurate statement?

Dec 03 '20 01:12 egoltsman