minimap2 icon indicating copy to clipboard operation
minimap2 copied to clipboard

How to interpret the mmi index file ?

Open ChenDepp opened this issue 4 years ago • 3 comments

hi, everyone! i hava a question about reference genome index file constructed by minimap2,Is there a detailed document? Similar to sam file format document? looking forward your reply, thanks

ChenDepp avatar Sep 26 '21 06:09 ChenDepp

Hi ChenDepp, As far as I can tell, there is little exact information about the file format. However, you can dissect the loading/storing routines to understand how the data is to be interpreted. The main magic happens in the mm_idx_load and mm_idx_dump functions. If you are interested, I have reimplemented the main functions for the index in python to make analysis easier, I can share this with you.

The main parts of the file format are as follows: First a magic string is stored ("MMI\x02", MM_IDX_MAGIC in the source) in the first four bytes. This is followed by 5 integers of constants: minimizer width and window length w and k, some hardcoded value 14, the number of bins I think (2**14), number of sequences and some flags in the last value. After this short header, the sequence information is stored: The number of characters in the name, the name in ascii, the length as integer. This is repeated for each sequence. This is then followed by the bulk of the data, which corresponds to iterating over all buckets in the hashmap.

Hope this somewhat is in the direction that you are expecting.

christian-lanius avatar Sep 27 '21 11:09 christian-lanius

@christian-lanius
Thanks you, I will dissect the mm_idx_load and mm_idx_dump functions, But i am not good at C language, If you can provide the corresponding python code,will save me a lot of trouble。

ChenDepp avatar Sep 27 '21 15:09 ChenDepp

I have shared the respective parts in this gist: https://gist.github.com/christian-lanius/8b8d6a38e35b93783ff7d6236211ff5e

I have not used this code in a while, but I think it used to work back in the day. Some of the commented out stuff you can ignore. This code is very unpythonic, mainly because I byte cast my way around. Some of it is a bit ugly for the sake of performance.

christian-lanius avatar Sep 28 '21 07:09 christian-lanius