vrs-python icon indicating copy to clipboard operation
vrs-python copied to clipboard

Same-as-reference Alleles should have a ReferenceLengthExpression state

Open theferrit32 opened this issue 2 months ago • 4 comments

2. Compare the two Allele sequences, if:

   a. both are empty, the input Allele is a reference Allele. Return a new
      Allele with:

      1. the `location` from the input Allele.

      2. a `ReferenceLengthExpression` for the `state` with `length` and
         `repeatSubunitLength` both set to the length of the input `location`.

https://vrs.ga4gh.org/en/latest/conventions/normalization.html#allele-normalization

Intended implementation marked in code here: https://github.com/ga4gh/vrs-python/blob/f01f6484010433229354ec2abd17f74989a13d92/src/ga4gh/vrs/normalize.py#L137-L140

Adding a test to exercise this:

@pytest.mark.vcr
def test_reference_allele_rle(tlr):
    """Test that reference alleles (REF==ALT) are normalized to ReferenceLengthExpression."""
    # Test with gnomad format
    gnomad_ref_allele = "1-100210778-AA-AA"
    allele = tlr._from_gnomad(gnomad_ref_allele)

    expected = {
        "type": "Allele",
        "location": {
            "type": "SequenceLocation",
            "sequenceReference": {
                "type": "SequenceReference",
                "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
            },
            "start": 100210777,
            "end": 100210779,
        },
        "state": {
            "type": "ReferenceLengthExpression",
            "length": 2,
            "repeatSubunitLength": 2,
            "sequence": "AA",
        },
    }

    assert allele.model_dump(exclude_none=True) == expected

Fails with:

FAILED tests/extras/test_allele_translator.py::test_reference_allele_rle - AssertionError: assert {'type': 'Allele', 'location': {'type': 'SequenceLocation', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO'}, 'start': 100210777, 'end': 100210779}, 'state': {'type': 'LiteralSequenceExpression', 'sequence': 'AA'}} == {'type': 'Allele', 'location': {'type': 'SequenceLocation', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO'}, 'start': 100210777, 'end': 100210779}, 'state': {'type': 'ReferenceLengthExpression', 'length': 2, 'repeatSubunitLength': 2, 'sequence': 'AA'}}

  Common items:
  {'location': {'end': 100210779,
                'sequenceReference': {'refgetAccession': 'SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO',
                                      'type': 'SequenceReference'},
                'start': 100210777,
                'type': 'SequenceLocation'},
   'type': 'Allele'}
  Differing items:
  {'state': {'sequence': 'AA', 'type': 'LiteralSequenceExpression'}} != {'state': {'length': 2, 'repeatSubunitLength': 2, 'sequence': 'AA', 'type': 'ReferenceLengthExpression'}}

  Full diff:
    {
        'location': {
            'end': 100210779,
            'sequenceReference': {
                'refgetAccession': 'SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO',
                'type': 'SequenceReference',
            },
            'start': 100210777,
            'type': 'SequenceLocation',
        },
        'state': {
  -         'length': 2,
  -         'repeatSubunitLength': 2,
            'sequence': 'AA',
  -         'type': 'ReferenceLengthExpression',
  ?                  ^^^      ------
  +         'type': 'LiteralSequenceExpression',
  ?                  ^^^  ++++++
        },
        'type': 'Allele',
    }

theferrit32 avatar Oct 16 '25 17:10 theferrit32

I think this condition may also be just a defensive gate for the ref==alt case, but I could be wrong. If we return an RLE Allele from the prior ValueError exception I think we cannot hit this condition.

https://github.com/ga4gh/vrs-python/blob/f01f6484010433229354ec2abd17f74989a13d92/src/ga4gh/vrs/normalize.py#L147-L149

theferrit32 avatar Oct 16 '25 17:10 theferrit32

FWIW, the ref==alt situation exists within my dataset! These are instances in which previous versions of the the MANE RefSeq transcript differed from the reference genome at a couple of bases but later versions of the RefSeq transcripts changed these bases to the same bases found in the reference genome. In the meantime, there were "variants", consisting of the reference genome alleles, which had been interpreted against the RefSeq transcripts by the expert panel. So when the RefSeq transcripts were updated, we ended up with expert-classified "variants" for which ref==alt. So, thank you for addressing this situation!

melissacline avatar Oct 16 '25 23:10 melissacline

@melissacline great to hear that this will be useful.

The one I was testing with earlier was just something I picked and made up randomly. Would it be possible for you to provide 1 or 2 real example ref==alt "variant" expressions from your dataset? We could use those in our tests.

theferrit32 avatar Oct 20 '25 16:10 theferrit32

Hi Kyle - my apologies, but the variants that Johan Den Dunnen had alerted me to seem not to have made the switch to ref==alt. I'll still keep an eye out, and will note on the ticket if I find anything.

On Mon, Oct 20, 2025 at 9:18 AM Kyle Ferriter @.***> wrote:

theferrit32 left a comment (ga4gh/vrs-python#587) https://github.com/ga4gh/vrs-python/issues/587#issuecomment-3422795588

@melissacline https://github.com/melissacline great to hear that this will be useful.

The one I was testing with earlier was just something I picked and made up randomly. Would it be possible for you to provide 1 or 2 real example ref==alt "variant" expressions from your dataset? We could use those in our tests.

— Reply to this email directly, view it on GitHub https://github.com/ga4gh/vrs-python/issues/587#issuecomment-3422795588, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANCVB3BCJAWQNAESLVNXRD3YUDNJAVCNFSM6AAAAACJMSCE3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMRSG44TKNJYHA . You are receiving this because you were mentioned.Message ID: @.***>

melissacline avatar Nov 03 '25 17:11 melissacline