gnomad_methods
gnomad_methods copied to clipboard
Fix ref_allele_mismatch in liftover_expr for indel variants
Hi, it seems the current ref_allele_mismatch doesn't work when the original reference allele has more than one bp. This PR fixes the issue.
Additionally, I found there are ref_allele_mismatch cases where ref/alt flips between the builds. For example, chr3:195513874:T:C (GRCh38) corresponds to 3:195240670:C:T (GRCh37).
It is particularly confusing since it looks like new_alleles == original_alleles but ref_allele_mismatch is true. Does it make sense to add a field e.g. ref_alt_flip? If so, I'd be happy to make a separate PR.
+---------------+------------+---------------+-------------+----------------+------------------+---------------------+---------------------+
| locus | alleles | new_locus | new_alleles | original_locus | original_alleles | locus_fail_liftover | ref_allele_mismatch |
+---------------+------------+---------------+-------------+----------------+------------------+---------------------+---------------------+
| locus<GRCh37> | array<str> | locus<GRCh37> | array<str> | locus<GRCh38> | array<str> | bool | bool |
+---------------+------------+---------------+-------------+----------------+------------------+---------------------+---------------------+
| 3:195236460 | ["T","C"] | 3:195236460 | ["T","C"] | chr3:195509651 | ["T","C"] | false | true |
| 3:195236756 | ["G","C"] | 3:195236756 | ["G","C"] | chr3:195509947 | ["G","C"] | false | true |
| 3:195237817 | ["C","A"] | 3:195237817 | ["C","A"] | chr3:195511010 | ["C","A"] | false | true |
| 3:195240670 | ["T","C"] | 3:195240670 | ["T","C"] | chr3:195513874 | ["T","C"] | false | true |
| 3:195242526 | ["T","C"] | 3:195242526 | ["T","C"] | chr3:195515729 | ["T","C"] | false | true |
| 3:195244526 | ["T","C"] | 3:195244526 | ["T","C"] | chr3:195517728 | ["T","C"] | false | true |
| 3:195245027 | ["G","A"] | 3:195245027 | ["G","A"] | chr3:195518229 | ["G","A"] | false | true |
| 3:195245225 | ["G","A"] | 3:195245225 | ["G","A"] | chr3:195518427 | ["G","A"] | false | true |
| 3:195246299 | ["G","A"] | 3:195246299 | ["G","A"] | chr3:195519502 | ["G","A"] | false | true |
| 3:195251227 | ["G","C"] | 3:195251227 | ["G","C"] | chr3:195524429 | ["G","C"] | false | true |
+---------------+------------+---------------+-------------+----------------+------------------+---------------------+---------------------+
Sorry, I'm not sure I'm following. The examples you have are all SNPs, so I don't see how this affects indels. For the SNPs, yes, grabbing the ref context and checking for a flip is the ideal scenario, but note that it comes at the computational cost of the context lookup, so I'd advocate for a flag where you can turn it off. Edit to add: I see the lookup already exists, which I guess is fine
Oh! I just realized this was two different suggestions (separated by "Additionally") - sorry about that! Yes, this seems sensible, though it could be worthwhile to have a documented example just so we can see what happens.