BioAlignments.jl icon indicating copy to clipboard operation
BioAlignments.jl copied to clipboard

`ref2seq` returns a position outside of the sequence length

Open MillironX opened this issue 2 years ago • 0 comments

Expected Behavior

The ref2seq function should give a valid index that can be used to find the corresponding base in a BAM.sequence.

Current Behavior

ref2seq returns an invalid index position that is greater than the total length of the sample sequence.

Possible Solution / Implementation

My hunch is that ref2seq isn't smart enough to deal with the 'Hard Clip' (H) operation when counting anchors, as I've only encountered this bug on BAM alignments that start with H. I've looked at the source, but can't determine without attaching a debugger if that is indeed the case.

Steps to Reproduce (for bugs)

  1. Download the offending reference.fasta and sample.bam
  2. Run the following
using BioAlignments
using FASTX
using XAM

# Get the reference sequence
reference_reader   = open(FASTA.Reader, "reference.fasta")
reference_record   = read(reference_reader)
reference_sequence = sequence(reference_record)
close(reference_reader)

# Read the problematic alignment
bam_reader = open(BAM.Reader, "sample.bam")
bam_record = read(bam_reader)
aligned_sequence = AlignedSequence(reference_sequence, BAM.alignment(bam_record))

# Let's look at postion 50 in the reference
reference_sequence[50]

# Is reference position 50 within the BAM sequence?
(50 in BAM.position(bam_record):BAM.rightposition(bam_record)) && println("it is")

# Find where reference position 50 is relative to the sequence
pos = ref2seq(aligned_sequence, 50)[1]

# Look at position 50 within the sequence
BAM.sequence(bam_record)[pos]

close(bam_reader)

Output

DNA_G
it is
ERROR: BoundsError: attempt to access BioSequences.LongDNASeq at index [284]
Stacktrace:
 [1] checkbounds
   @ ~/.julia/packages/BioSequences/k4j4J/src/biosequence/indexing.jl:117 [inlined]
 [2] getindex(seq::BioSequences.LongDNASeq, i::Int64)
   @ BioSequences ~/.julia/packages/BioSequences/k4j4J/src/biosequence/indexing.jl:142
 [3] top-level scope
   @ REPL[37]:1

Context

I'm trying to identify and compare variants at specific locations along the reference genome. Basically, I'd like to call a function with the reference genome position and the BAM alignment, and get the base that was called at that position. The ultimate goal is using this data in variant calling-type applications.

Your Environment

  • Package Version used: 2.0.0
  • Julia Version used: 1.6.1
  • Operating System and version (desktop or mobile): Fedora 34 (x86_64)
  • Link to your project: N/A
  • Installed packages
    • BioAlignments
    • BioSequences
    • BioSymbols
    • CSV
    • Combinatorics
    • DataFrames
    • FASTX
    • HypothesisTests
    • XAM

MillironX avatar Aug 16 '21 22:08 MillironX