aldy icon indicating copy to clipboard operation
aldy copied to clipboard

KeyError('unknown reference MT') on GRCh38.d1.vd1(UCSC-style-name)

Open Anastasiabum opened this issue 3 months ago • 4 comments

Hi! When running aldy==4.8.2 on a BAM aligned to GRCh38.d1.vd1 (UCSC-style-name), the program crashes with:

Gene MT-RNR1
ERROR: gene= all, file= /inputs/xxxxxx_dnaseq-alignment_v1.5.0_bwa2.bam
KeyError('unknown reference MT')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/aldy/__main__.py", line 126, in main
    _genotype(args.gene, output, args)
  File "/usr/local/lib/python3.10/dist-packages/aldy/__main__.py", line 447, in _genotype
    run(None)
  File "/usr/local/lib/python3.10/dist-packages/aldy/__main__.py", line 397, in run
    _ = genotype(
  File "/usr/local/lib/python3.10/dist-packages/aldy/genotype.py", line 120, in genotype
    **genotype(
  File "/usr/local/lib/python3.10/dist-packages/aldy/genotype.py", line 180, in genotype
    sample = sam.Sample(gene, profile, sam_path, reference, debug)
  File "/usr/local/lib/python3.10/dist-packages/aldy/sam.py", line 117, in __init__
    norm, muts = self._load_sam(path, reference, debug)
  File "/usr/local/lib/python3.10/dist-packages/aldy/sam.py", line 171, in _load_sam
    self._realign_indels(tmp, sam, reference)
  File "/usr/local/lib/python3.10/dist-packages/aldy/sam.py", line 463, in _realign_indels
    sz = sam.get_reference_length(rname)
  File "pysam/libcalignmentfile.pyx", line 1916, in pysam.libcalignmentfile.AlignmentFile.get_reference_length
  File "pysam/libcalignmentfile.pyx", line 511, in pysam.libcalignmentfile.AlignmentHeader.get_reference_length
KeyError: 'unknown reference MT'

Cause: Aldy expects MT or chrMT, GRCh38.d1.vd1 (UCSC-style) uses chrM.

Possibly, the chr_prefix function is the cause. It correctly adds the chr prefix for chromosomes (1,2,…,X,Y), but it does not handle the special case of the mitochondrial chromosome: in hg19 it is usually called MT, while in GRCh38.d1.vd1 it is chrM. When chr is added to MT, it becomes chrMT, which does not exist in the BAM file, leading to the error.

It seems that Aldy’s test BAM files don't contain the mitochondrial chromosome at all, so this problem doesn't show up in your tests. For example:

samtools view -h /inputs/HG03166.pb.bam | grep "@SQ"

@SQ     SN:chr1 LN:248956422
...
@SQ     SN:chr21        LN:46709983
@SQ     SN:chr22        LN:50818468
@SQ     SN:chrX LN:156040895
@SQ     SN:chrY LN:57227415

Could you please add support for chrM? That would make Aldy work smoothly with GRCh38.d1.vd1 (UCSC-style) -aligned BAMs

Thanks a lot for your work on this project!

Anastasiabum avatar Sep 01 '25 15:09 Anastasiabum

Hi @Anastasiabum ,

Thank you for the bug report!

Can you please try #100 and see if that fixes the problem? If so, I will merge this.

inumanag avatar Oct 20 '25 17:10 inumanag

Hello @inumanag, we are expiriencing same problem and your fix is not working for us.

According to my understanding of code you are checking for prefix for each gene separately, not globally: https://github.com/0xTCG/aldy/blob/fix-chrm/aldy/sam.py#L562-L566

            self._prefix = chr_prefix(
                self.gene.chr, [x["SN"] for x in sam.header["SQ"]]
            )
            if self.gene.chr == "MT" and self._prefix == "chr":
                self.gene.chr = "M"  # UCSC fix

And we are checking by adding prefix chr to chrs list: https://github.com/0xTCG/aldy/blob/fix-chrm/aldy/common.py#L236-L243

def chr_prefix(ch: str, chrs: List[str]) -> str:
    """
    Check if a chromosome needs "chr" prefix given the available chromosomes.
    :returns: Chromosome prefix if the chromosome does not have it.
    """
    if ch not in chrs and "chr" + ch in chrs:
        return "chr"
    return ""

But if ch is MT - there are no chrMT in BAM-header and prefix chr is empty for MT - so your fix is not changing self.gene.chr value.

Dirty way to fix is to add chr prefix if chrM is present and after change self.gene.chr to M as you implemented. But I think it will be better to remove prefix logic and return correct chr name instead (I understand that it will need some time)

Stikus avatar Oct 29 '25 07:10 Stikus

@Stikus MT without the prefix should work (but to does not :) ).

Could you please send me a SAM header with a single read from chrM[T] so that I can do a quick test? Thanks!

inumanag avatar Oct 29 '25 15:10 inumanag

SAM/BAM header for chrM:

@SQ     SN:chrM LN:16569        AS:GRCh38.d1.vd1        SP:Homo sapiens

Read from chrM:

A00921:12:HNF33DSXX:1:1103:15275:28635  163     chrM    1       60      20S131M =       174     321     TTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF      MC:Z:136M3I12M  MD:Z:131        RG:Z:10_1_S19_L001_R    NM:i:0  AS:i:131        XS:i:86

Stikus avatar Oct 29 '25 15:10 Stikus

Dear @inumanag, any news?

serge2016 avatar Nov 24 '25 06:11 serge2016