minimap2 icon indicating copy to clipboard operation
minimap2 copied to clipboard

Minimap2 --split-prefix option

Open YasirKusay opened this issue 2 years ago • 5 comments

Hi, I would like to know more about the --split-prefix option.

I have a very large (1200 GB) index that I want to align and I of course don't want to load the entire index into memory. I wanted to test out minimap2 --split-prefix on an assembly of 17000 contigs using relatively small resources (48 GB of ram, 16 threads and 150 GB of disk space) just to see how it worked and I expected that the command will fail (as it would run out of disk space to store the index partitions) but the command actually executed to completion (taking 134 minutes). I am confused, did alignment happen against the entire index, or did something else happen?

If this helps, I did inspect the index partitions during execution and they were about 20 MB each.

YasirKusay avatar Mar 13 '22 04:03 YasirKusay

Hi, This is just an extension to the above. I would like to be able to run minimap2 incrementally (e.g. load 4 GB of the index into RAM, align, move on to the next 4 GB of the index, etc). I did try and run the program on default settings with 5GB and 10GB of RAM just to see what would happen, but there was an error: 24781 Killed and 17044 Killed respectively.

I don't know why minimap2 failed, but I thought that it loaded 4GB of the index per alignment, before moving onto the next index batch.

YasirKusay avatar Mar 14 '22 12:03 YasirKusay

Here is some information from the first implementation of the --split-prefix:

It loads the index part by part, while iteratively mapping the queries to each index partition. The intermediate results will be saved as temporary files. Finally, it will go through all the temporary files and merge the results. Detailed methodology is available at https://www.nature.com/articles/s41598-019-40739-8 and technical information in https://static-content.springer.com/esm/art%3A10.1038%2Fs41598-019-40739-8/MediaObjects/41598_2019_40739_MOESM1_ESM.pdf.

Here are some example commands we originally used for testing on the human genome: https://github.com/hasindu2008/minimap2-arm/tree/master/misc/idxtools

  1. Is the 1200GB index a fasta file or a minimap2 index?
  2. When you say 4GB of the index - are you referring to -I 4G option? As far as I am aware it means 4 Gigabases of the reference. This can become like 10-12GB depending on the reference characteristics. And minimap by default loads a batch of queries (1G as I remember), and for those and intermediate data structures need extra RAM.
  3. The disk usage for temporary will depend on the size of your query reads rather than the reference. What is the size of your fastq?

hasindu2008 avatar Mar 14 '22 13:03 hasindu2008

Hi @hasindu2008, Thank you for your reply.

I was actually confused by --split-prefix option initially, as I assumed that the temporary files were partitioned indexes rather than the results, so thank you for clearing that up.

  1. The 1200GB index is the actual minimap2 index.
  2. I am not referring to the -I 4G index as I already have the index. Does it still load 4 gigabases of the index while doing the alignment with the default settings?

My primary concern is to be able to use the entire 1200 GB index, with as little RAM as possible. Based on what you have told me, I think that I can achieve this using the --split-prefix option. Does that mean the default settings for alignments will load the entire index at once? If not, what is the difference between the default settings and --split-prefix option?

YasirKusay avatar Mar 14 '22 14:03 YasirKusay

If you created minimap2 index with default options, it will create load create an index with multiple parts each with 4Gbases. So yes, it will load only 4 gigabases at a time with default options.

The default settings will not load the whole index, it will still iteratively map part by part and output all mappings to each part. In summary, if --split-prefix is used, the mappings will be more accurate than without it - https://www.nature.com/articles/s41598-019-40739-8 contains all the information.

hasindu2008 avatar Mar 15 '22 01:03 hasindu2008

Thanks for your reply!

I will now use minimap2 with --split-prefix as the option (I did notice that both --split-prefix and the default settings actually took similar times).

On a side note, has this tool ever been benchmarked using the NCBI NT index?

YasirKusay avatar Mar 15 '22 01:03 YasirKusay