mavis icon indicating copy to clipboard operation
mavis copied to clipboard

Add support for long-read bams (genome)

Open oneillkza opened this issue 4 years ago • 2 comments

Reading in vcfs from variant callers that run on long-read bams is only part of the problem. MAVIS still needs bam files for most operations. Such bams have a few key differences from short-read ("NGS") sequence:

  • Single end rather than paired-end
  • Variable (and long) read length
  • Relatively high error rate (5-10%), especially for homopolymers

This makes them very good for detecting large structural variants, especially since they can map through low-complexity regions, but less good for smaller variants.

This ticket is to track work on reading in long-read genome bams.

oneillkza avatar May 27 '20 15:05 oneillkza

So, the first major design decision is to create a new file type, genome_longread for long read genomic bams. This is distinct from genome, for short read paired-end genomic bams. I'm probably going to be copying a lot of the code to handle the genome bam type, but I think that'll be cleaner than having if statements everywhere.

e.g. in stats I've created compute_genome_longread_bam_stats, which is a modified copy of compute_genome_bam_stats

oneillkza avatar Jun 10 '20 19:06 oneillkza

OK, got it as far as being able to do config and setup. Clustering works, but it fails on validate.

ValueError: ('protocol error', 'genome_longread')

This is somewhat unsurprising. Looks like the next step is to create a class in validate/evidence.py, and a case in validate/main.py to match up the genome_longread protocol to.

oneillkza avatar Jun 17 '20 19:06 oneillkza