htslib icon indicating copy to clipboard operation
htslib copied to clipboard

Breaking change with release 1.6 for malformed reads

Open ghost opened this issue 6 years ago • 2 comments

Problem

Release 1.6 introduced checks on alignments, such as inconsistent CIGARs. This is great, but the current implementation causes iterators such as sam_itr_next to subsequently flag a premature EOF which is not correct. This leads to code prematurely exiting from a contig when all the code should do is flag the existence of a malformed CIGAR (etc.) and carry on.

Release 1.5 and earlier ignores such errors in the read data, which means that release 1.6 leads to a relatively silent and breaking change in behavior, when it really should be restricted to improved data error reporting.

Also, it would be nice to have the error codes (-1, -4 etc.) written out as constants in a centralized header somewhere so it is easier for users to know how to interpret the return values.

As always, many thanks for creating and maintaining this beautiful library!

Steps to reproduce

1. Run this python3 script to generate a test BAM. pysam required.

create-bam-with-bad-read.py

"""
The purpose of this script is to create a minimal test BAM that has, for one
contig, a read in the middle with a bad CIGAR. 
"""
import pysam


# Code lifted from pysam example in docs
def create_read(n, tid, pos, bad_cigar=False):
  a = pysam.AlignedSegment()
  a.query_name = "read-{}".format(n)
  a.query_sequence = "A" * 20
  a.flag = 0b1000010 | (0b0 if tid > -1 else 0b100)
  a.reference_id = tid
  a.reference_start = pos
  a.mapping_quality = 20 if tid > -1 else 0
  a.cigar = ((0, 20),) if not bad_cigar else ((0, 21),)
  a.next_reference_id = 0
  a.next_reference_start = 199
  # a.template_length = 167
  a.query_qualities = pysam.qualitystring_to_array("<" * 20)
  return a

header = {'HD': {'VN': '1.0'},
          'SQ': [{'LN': 1575, 'SN': 'chr1'},
                 {'LN': 1584, 'SN': 'chr2'}]}

fout = 'small_test.bam'

print(pysam.__version__)

with pysam.AlignmentFile(fout, "wb", header=header) as outf:
  for n in range(10):
    outf.write(create_read(n, 0, 100 + n))

  for n in range(5):
    outf.write(create_read(n, 1, 100 + n))

  outf.write(create_read(5, 1, 100 + 5, bad_cigar=True))

  for n in range(6, 10):
    outf.write(create_read(n, 1, 100 + n))

outf.close()
pysam.index(fout)

2. Compile this minimal C++ program to read contigs

Compile with

g++ --std=c++14 simple-reader.cpp -o srdr /usr/local/lib/libhts.dylib /usr/lib/libz.dylib

simple-reader.cpp

/*
  Bare minimum program to open a BAM and print out the reads from given
  region.

  compile with

  g++ --std=c++14 simple-reader.cpp -o srdr /usr/local/lib/libhts.dylib /usr/lib/libz.dylib

  **Modify for your local install of htslib and zlib**
*/
#include <iostream>
#include <stdexcept>
#include <htslib/sam.h>


int main( int argc, char* argv[] )
{
  samFile*       fp;
  bam_hdr_t* header;
  hts_idx_t*    idx;
  hts_itr_t*   iter;

  char *fname = argv[ 1 ];
  char *contig = argv[ 2 ];

  fp = sam_open( fname , "rb" );
  if( fp == nullptr) { std::runtime_error("Error opening BAM file"); }

  header = sam_hdr_read( fp );
  if( header == nullptr) { std::runtime_error("Error opening header"); }
  
  idx = sam_index_load( fp, fname );
  if( idx == nullptr) { std::runtime_error("Error opening index"); }
  
  iter = sam_itr_querys( idx, header, contig );    
  if( iter == nullptr) { std::runtime_error("Error creating iterator"); }

  std::cout << "htslib " << hts_version() << std::endl;

  bam1_t* this_read;
  size_t read_counts = 0;
  for(;;)
  {
    this_read = bam_init1();
    int ret_val = sam_itr_next( fp, iter, this_read );
    if( ret_val < 0 )
    {
      if( ret_val == -1 )
      {
        std::cout << "EOF" << std::endl;
        bam_destroy1( this_read );
        break; 
      }
      else
      {
        std::cerr << "Corrupt read found with return value " << ret_val << std::endl;
      }
    }

    bam_destroy1( this_read );
    read_counts++;
  }

  std::cout << "Found " << read_counts << " reads" << std::endl;

  if( fp != nullptr )
  {
    hts_itr_destroy( iter );
    hts_idx_destroy( idx );
    bam_hdr_destroy( header );  
    sam_close( fp );                
  }
}

3. Verify that the code fetches 10 alignments from both contigs when compiled with htslib 1.5 or earlier

./srdr small_test.bam chr1
htslib 1.5
EOF
Found 10 reads

./srdr small_test.bam chr2
htslib 1.5
EOF
Found 10 reads

4. Verify that the code exits early from contig 2 when compiled with htslib 1.6 due to malformed read

./srdr small_test.bam chr1
htslib 1.6
EOF
Found 10 reads

./srdr small_test.bam chr2
htslib 1.6
[E::bam_read1] CIGAR and query sequence lengths differ for read-5
Corrupt read found with return value -4
EOF
Found 6 reads

ghost avatar Nov 17 '17 18:11 ghost

Thanks for the comprehensive report.

The CIGAR length check was put in to avoid problems in code that assumes the sequence length derived from the alignment is the same as that of the sequence in the alignment record. An example of this was mpileup (see samtools/samtools#718). The check was already there in the SAM reader, so this just made the BAM one do the same thing.

The documentation for sam_itr_next is unfortunately absent, but the related sam_read1 does at least include this comment:

/*!
     *  @return >= 0 on successfully reading a new record, -1 on end of stream, 
< -1 on error
     **/

It's very important when using both these functions to check the return value. If it is -1, you've reached a normal end of file. If it's anything less than -1, an error has occurred and you must stop reading. So the correct way to write the read loop is something like this:

bam1_t *this_read = bam_init1();
if (!this_read) return -1; // Report failure
int ret_val;
while ((ret_val = sam_itr_next(fp, iter, this_read)) >= 0) {
    read_counts++;
}
if (ret_val < -1) return -1;  // Report failure
bam_destroy1(this_read);

Finally, whatever is making the broken CIGAR record should be fixed. As already noted, some tools make assumptions about having well-formed CIGAR data, so it is not a good idea to propagate them.

daviesrob avatar Nov 20 '17 16:11 daviesrob

@daviesrob , thank you for your response. Htslib now forces me to come to a complete stop when it finds one read with a data inconsistency. I may not want this behavior. I may want to keep processing, discarding that bad read.

Currently I would have to write hacky workaround code that won't work in all cases. It would be much nicer of htslib to tell me about the bad read and let me keep going if I wanted to, leaving that decision to user space.

ghost avatar Nov 20 '17 17:11 ghost