medaka icon indicating copy to clipboard operation
medaka copied to clipboard

indexing large genome bam file

Open aoladzad-soundag opened this issue 2 years ago • 4 comments

I'm working on a very large genome (4.5 G), when we run medaka, we gets the following error which seems to be related to bam indexing process because of the large genome, I know it would work with samtools -c which generates bam index csi and tried to modified the Wrappers.py and add -c option to the samtools command, yet getting same error, so I think I need to know where should I make this modification?

[M::worker_pipeline::14674.2757.95] mapped 126569 sequences [M::worker_pipeline::15235.0637.95] mapped 127219 sequences [M::worker_pipeline::15788.7047.95] mapped 127124 sequences [M::worker_pipeline::16348.9997.95] mapped 129707 sequences [M::worker_pipeline::16904.9507.96] mapped 128537 sequences [M::worker_pipeline::17465.0347.96] mapped 127442 sequences [M::worker_pipeline::18035.1987.96] mapped 127657 sequences [M::worker_pipeline::18569.9777.96] mapped 130177 sequences [M::worker_pipeline::18903.816*7.96] mapped 77228 sequences [M::main] Version: 2.22-r1101 [M::main] CMD: minimap2 -x map-ont --MD -t 8 -a -A 2 -B 4 -O 4,24 -E 2,1 /home/dnalinux/efs/g34xl/Pisum_sativum_v1a.fa.mmi /home/dnalinux/efs/medaka/fastq/allfastq.gz [M::main] Real time: 18904.149 sec; CPU: 150472.292 sec; Peak RSS: 14.238 GB [bam_sort_core] merging from 40 files and 8 in-memory blocks... [E::hts_idx_check_range] Region 536869540..536872741 cannot be stored in a bai index. Try using a csi index [E::sam_index] Read '81baeba4-d17c-466a-ae19-73eeef0a548a' with ref_name='chr5LG3', ref_length=579269071, flags=0, pos=536869541 cannot be indexed samtools index: failed to create index for "calls_to_draft.bam": Numerical result out of range Failed to index alignment file. Failed to run alignment of reads to draft.

Thanks

aoladzad-soundag avatar Oct 15 '21 20:10 aoladzad-soundag

We have a similar problem when working with long chromosomes, for example with wheat

The offending line is

https://github.com/nanoporetech/medaka/blob/9dffdee0de4c76a9bf382b601261db43bd030c26/medaka/wrappers.py#L79

As the previous poster says, this problem could be alleviated using a csi not the default bai index.

That is, instead of using

index_cmd = ['samtools', 'index', out_bam, '-@', threads]

this should be sufficient (added -c)

index_cmd = ['samtools', 'index', '-c', out_bam, '-@', threads]

The csi index should otherwise be completely compatible with the bai, except for it's name of course.

colindaven avatar Jan 03 '24 16:01 colindaven

The wrappers.py is no longer used (in fact I don't know if it was ever used). In is the script mini_align where some change may be required in order to produce a .csi index. Even with that change there would need to be other changes within medaka to allow it to read from the resultant .csi as if I recall it assumes the existence of a .bai index file.

cjw85 avatar Jan 03 '24 16:01 cjw85

I quick look at the code suggests I might be wrong on that account and the core code qill happily accept a .csi.

cjw85 avatar Jan 03 '24 16:01 cjw85

Ok, thanks. I submitted a PR.

For others, if ONT can't or won't accept the PR due to other breakages being caused, you can modify the mini_align script from the pomoxis repo yourself.

  • The following is for a Dockerfile. Strip out the RUN if you want to modify your conda/mamba installed files directly.
  • You'll need to find and replace the path to your mini_align file on your system (use which mini_align when your medaka conda env is activated, and modify the example below)
# Modify ONT pomoxis mini_align script to use a csi not bai index
#samtools index -c -@ "${THREADS}" "${PREFIX}.bam" "${PREFIX}.bam.csi" \
RUN sed -i 's/samtools index/samtools index -c /g' /opt/micromamba/bin/mini_align
RUN sed -i 's/bai/csi/g' /opt/micromamba/bin/mini_align
RUN chmod a+x /opt/micromamba/bin/mini_align

colindaven avatar Jan 04 '24 09:01 colindaven