minimap2 icon indicating copy to clipboard operation
minimap2 copied to clipboard

biased distribution of short alignments between ref and alt chromosomes

Open xtan1221 opened this issue 1 year ago • 0 comments

Dear Heng and team,

Recently I ran minimap2 to align HG002 (alt) to GRCh37 (ref) with default settings to facilitate SV detection. The alignment result showed biased distribution of short alignments between ref and alt chromosomes as shown in the below figure.

image all chromosomes are concatenated together. X axis shows the ref chromosomes (chr1, chr2, ... chrX, chrY) and Y axis shows the alt chromosomes (for maternal (left): chr1, chr2, ... chrX, for paternal (right): chr1, chr2, ... chrY). Each alignment is visualized as a rectangle on the 2D space. As you can see within the two chr subsets chr1-chr12 and chr13-chrX on ref and alt, the number of alignments are quite small (a few to dozens for each non-homolog chr pair), while much more alignments (mostly are very short) were found between those two subsets (thousands to more than 10 thousands for each non-homolog chr pair).

I believe this should not be something real? but rather artifact of the program? To check that I also run minimap2 between HG002 maternal and paternal with each of them as ref: image the left figure shows maternal as ref, while the right figure shows paternal as ref. If these short alignments are real, we may expect to see somehow same pattern on those two figures, but you can see it is not the case based on the alignments between maternal chrX vs paternal chrs as highlighted by the red colored rectangles.

I am not sure if I made some stupid mistake while running the program or this is something known. I really appreciate it if you can put some thoughts on this. Thanks!

Here is the sample scripts showing the parameters I used to run minimap2:

sudo time minimap2-2.24_x64-linux/minimap2 --eqx -c -I 2G -t 8 -K100M $ref $qry > $out_paf

here is the log message for the run between HG002 maternal vs GRCh37:

[M::mm_idx_gen::37.525*1.36] collected minimizers
[M::mm_idx_gen::41.032*1.87] sorted minimizers
[M::main::41.032*1.87] loaded/built the index for 12 target sequence(s)
[M::mm_mapopt_update::42.767*1.83] mid_occ = 472
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 12
[M::mm_idx_stat::43.663*1.82] distinct minimizers: 85926560 (42.42% are singletons); average occurrences: 4.390; average spacing: 5.527; total length: 2084766301
[M::worker_pipeline::5752.738*1.01] mapped 1 sequences
[M::worker_pipeline::9167.236*1.00] mapped 1 sequences
[M::worker_pipeline::11638.852*1.00] mapped 1 sequences
[M::worker_pipeline::14189.550*1.00] mapped 1 sequences
[M::worker_pipeline::16587.553*1.00] mapped 1 sequences
[M::worker_pipeline::18497.698*1.00] mapped 1 sequences
[M::worker_pipeline::20891.452*1.00] mapped 1 sequences
[M::worker_pipeline::22254.003*1.00] mapped 1 sequences
[M::worker_pipeline::25949.570*1.00] mapped 1 sequences
[M::worker_pipeline::27231.094*1.00] mapped 1 sequences
[M::worker_pipeline::28570.392*1.00] mapped 1 sequences
[M::worker_pipeline::29814.028*1.00] mapped 1 sequences
[M::worker_pipeline::30943.097*1.00] mapped 1 sequences
[M::worker_pipeline::32759.878*1.04] mapped 2 sequences
[M::worker_pipeline::34449.020*1.07] mapped 2 sequences
[M::worker_pipeline::35366.766*1.09] mapped 2 sequences
[M::worker_pipeline::36187.403*1.10] mapped 2 sequences
[M::worker_pipeline::41524.617*1.10] mapped 2 sequences
[M::mm_idx_gen::41541.365*1.10] collected minimizers
[M::mm_idx_gen::41542.747*1.10] sorted minimizers
[M::main::41542.747*1.10] loaded/built the index for 12 target sequence(s)
[M::mm_mapopt_update::41542.747*1.10] mid_occ = 472
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 12
[M::mm_idx_stat::41543.613*1.10] distinct minimizers: 57169946 (54.08% are singletons); average occurrences: 2.833; average spacing: 6.241; total length: 1010911111
[M::worker_pipeline::49643.151*1.08] mapped 1 sequences
[M::worker_pipeline::56024.184*1.07] mapped 1 sequences
[M::worker_pipeline::60867.719*1.07] mapped 1 sequences
[M::worker_pipeline::66034.042*1.06] mapped 1 sequences
[M::worker_pipeline::70501.731*1.06] mapped 1 sequences
[M::worker_pipeline::74339.415*1.06] mapped 1 sequences
[M::worker_pipeline::77800.803*1.05] mapped 1 sequences
[M::worker_pipeline::80296.826*1.05] mapped 1 sequences
[M::worker_pipeline::84942.481*1.05] mapped 1 sequences
[M::worker_pipeline::87055.314*1.05] mapped 1 sequences
[M::worker_pipeline::89403.354*1.05] mapped 1 sequences
[M::worker_pipeline::91653.667*1.05] mapped 1 sequences
[M::worker_pipeline::92441.662*1.05] mapped 1 sequences
[M::worker_pipeline::94135.373*1.06] mapped 2 sequences
[M::worker_pipeline::95684.181*1.06] mapped 2 sequences
[M::worker_pipeline::96362.825*1.07] mapped 2 sequences
[M::worker_pipeline::96961.474*1.07] mapped 2 sequences
[M::worker_pipeline::100219.373*1.08] mapped 2 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: /home/ubuntu/installedProgram/minimap2-2.24_x64-linux/minimap2 --eqx -c -I 2G -t 8 -K100M /data/genome/GRCh37/GRCh37_base_all.rename.fasta /data/genome/HG002/GCA_021951015.1_HG002.mat.cur.20211005_genomic.chrom.only.renamed.fna
[M::main] Real time: 100219.536 sec; CPU: 107796.930 sec; Peak RSS: 49.227 GB
107553.57user 243.72system 27:50:19elapsed 107%CPU (0avgtext+0avgdata 51618108maxresident)k
12066504inputs+1077728outputs (1major+228417083minor)pagefaults 0swaps

xtan1221 avatar Apr 07 '23 16:04 xtan1221