hgvs icon indicating copy to clipboard operation
hgvs copied to clipboard

Projecting indels across gap causes length change and non-reversibility

Open davmlaw opened this issue 10 months ago • 2 comments

I understand a variant growing bigger if the destination reference has an insertion, but shouldn't it be put back when it goes the other way?

original_hgvs = "NM_015120.4(ALMS1):c.36_38dupGGA"

def print_hgvs(sv):
    length = sv.posedit.pos.end - sv.posedit.pos.start
    print(f"hgvs='{sv}' - {length=}")

var_c = parse(original_hgvs)
print_hgvs(var_c)
var_g = c_to_g(var_c)
print_hgvs(var_g)
var_c2 = g_to_c(var_g, var_c.ac)
print_hgvs(var_c2)

Output:

hgvs='NM_015120.4(ALMS1):c.36_38dup' - length=2
hgvs='NC_000002.12:g.73385937_73385942dup' - length=5
hgvs='NM_015120.4:c.72_77dup' - length=5

Normlization?

I noticed that if you normalize this 1st, the problem goes away.

I think this is because normalization shifts the variant away from the gap. But this shouldn't matter? If you do need to normalize before projection then perhaps we should automatically do this or raise a warning or error if not normalized?

var_c_orig = parse(original_hgvs)
var_c = normalize(var_c_orig)
print(f"Normalized: {var_c_orig} => {var_c}")
print_hgvs(var_c)
var_g = c_to_g(var_c)
print_hgvs(var_g)
var_c2 = g_to_c(var_g, var_c.ac)
print_hgvs(var_c2)

Output:

Normalized: NM_015120.4(ALMS1):c.36_38dup => NM_015120.4:c.75_77dup
hgvs='NM_015120.4:c.75_77dup' - length=2
hgvs='NC_000002.12:g.73385940_73385942dup' - length=2
hgvs='NM_015120.4:c.75_77dup' - length=2

Note - while searching issues I found discussion about alignment gaps (on this transcript!) on #514

davmlaw avatar Feb 19 '25 07:02 davmlaw

To try and remove the normalization issue I made it so big it wouldn't shift, and was able to get it to shift from a dup to an ins:

original_hgvs = "NM_015120.4(ALMS1):c.36_77dup"
var_c_orig = parse(original_hgvs)
var_c = normalize(var_c_orig)
print(f"Normalized: {var_c_orig} => {var_c}")
print_hgvs(var_c)
var_g = c_to_g(var_c)
print_hgvs(var_g)
var_c2 = g_to_c(var_g, var_c.ac)
print_hgvs(var_c2)

Output:

Normalized: NM_015120.4(ALMS1):c.36_77dup => NM_015120.4:c.36_77dup
hgvs='NM_015120.4:c.36_77dup' - length=41
hgvs='NC_000002.12:g.73385942_73385943insGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGA'
hgvs='NM_015120.4:c.77_78insGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGAGGA'

So yeah I think normalization just hid it before.

I get the change going 1 way, but wondering if the conversion back is wrong, or there should def be a warning here

davmlaw avatar Feb 19 '25 07:02 davmlaw

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar May 21 '25 02:05 github-actions[bot]