SONAR icon indicating copy to clipboard operation
SONAR copied to clipboard

ID/DIV script can't recognize certain V IDs from the germDB files

Open ressy opened this issue 1 year ago • 2 comments

The BU_DD germline V files have a handful of entries with an "ORF_" prefix, and when the ID/DIV script tries to parse a V call that start with one of those names (with (v_call|V_gene)=((IG[HKL]V[^*]+)[^,\s]+)) it doesn't recognize any valid V calls for the sequence, so it puts "unknown" and "NA" in the output table for the V call and identity.

I wouldn't think it would much matter for sequences assigned to the ORFs anyway, except that I notice in practice those are usually followed by one or more regular matches (e.g. v_call=ORF_IGHV3-AHH-X*01,IGHV3-AFR*01) so the effect is to exclude those sequences even though they often do have a regular V call available. Would it work to either allow characters before the "IG" in the pattern, or split on the comma and select from the resulting list? I can propose something if one of those sounds preferable.

ressy avatar Jan 06 '24 19:01 ressy

Sorry, this fell through the cracks. I think the best answer is to remove the ORFs from the default database, which will happen when better databases are released (I think expected this year). But the regex is a bit brittle regardless. I think I have a more complicated one elsewhere (3.2 and/or 4.4) that can be copied over...

scharch avatar Feb 12 '24 23:02 scharch

No problem, I wasn't even going to bother making an issue at first until I realized the non-ORF matches got missed if there's an ORF one in front. I also don't think it comes up at all often for us (and even less so now that I'm generally using KIMDB for rhesus heavy chain, though a more recent database for all loci would be great).

ressy avatar Feb 14 '24 14:02 ressy