gnomad_methods icon indicating copy to clipboard operation
gnomad_methods copied to clipboard

Fix ref_allele_mismatch in liftover_expr for indel variants

Open mkanai opened this issue 5 years ago • 2 comments

Hi, it seems the current ref_allele_mismatch doesn't work when the original reference allele has more than one bp. This PR fixes the issue.

Additionally, I found there are ref_allele_mismatch cases where ref/alt flips between the builds. For example, chr3:195513874:T:C (GRCh38) corresponds to 3:195240670:C:T (GRCh37).

It is particularly confusing since it looks like new_alleles == original_alleles but ref_allele_mismatch is true. Does it make sense to add a field e.g. ref_alt_flip? If so, I'd be happy to make a separate PR.

+---------------+------------+---------------+-------------+----------------+------------------+---------------------+---------------------+
| locus         | alleles    | new_locus     | new_alleles | original_locus | original_alleles | locus_fail_liftover | ref_allele_mismatch |
+---------------+------------+---------------+-------------+----------------+------------------+---------------------+---------------------+
| locus<GRCh37> | array<str> | locus<GRCh37> | array<str>  | locus<GRCh38>  | array<str>       |                bool |                bool |
+---------------+------------+---------------+-------------+----------------+------------------+---------------------+---------------------+
| 3:195236460   | ["T","C"]  | 3:195236460   | ["T","C"]   | chr3:195509651 | ["T","C"]        |               false |                true |
| 3:195236756   | ["G","C"]  | 3:195236756   | ["G","C"]   | chr3:195509947 | ["G","C"]        |               false |                true |
| 3:195237817   | ["C","A"]  | 3:195237817   | ["C","A"]   | chr3:195511010 | ["C","A"]        |               false |                true |
| 3:195240670   | ["T","C"]  | 3:195240670   | ["T","C"]   | chr3:195513874 | ["T","C"]        |               false |                true |
| 3:195242526   | ["T","C"]  | 3:195242526   | ["T","C"]   | chr3:195515729 | ["T","C"]        |               false |                true |
| 3:195244526   | ["T","C"]  | 3:195244526   | ["T","C"]   | chr3:195517728 | ["T","C"]        |               false |                true |
| 3:195245027   | ["G","A"]  | 3:195245027   | ["G","A"]   | chr3:195518229 | ["G","A"]        |               false |                true |
| 3:195245225   | ["G","A"]  | 3:195245225   | ["G","A"]   | chr3:195518427 | ["G","A"]        |               false |                true |
| 3:195246299   | ["G","A"]  | 3:195246299   | ["G","A"]   | chr3:195519502 | ["G","A"]        |               false |                true |
| 3:195251227   | ["G","C"]  | 3:195251227   | ["G","C"]   | chr3:195524429 | ["G","C"]        |               false |                true |
+---------------+------------+---------------+-------------+----------------+------------------+---------------------+---------------------+

mkanai avatar Nov 18 '20 22:11 mkanai

Sorry, I'm not sure I'm following. The examples you have are all SNPs, so I don't see how this affects indels. For the SNPs, yes, grabbing the ref context and checking for a flip is the ideal scenario, but note that it comes at the computational cost of the context lookup, so I'd advocate for a flag where you can turn it off. Edit to add: I see the lookup already exists, which I guess is fine

konradjk avatar Mar 19 '21 12:03 konradjk

Oh! I just realized this was two different suggestions (separated by "Additionally") - sorry about that! Yes, this seems sensible, though it could be worthwhile to have a documented example just so we can see what happens.

konradjk avatar Mar 19 '21 12:03 konradjk