SONAR
SONAR copied to clipboard
ID/DIV script can't recognize certain V IDs from the germDB files
The BU_DD
germline V files have a handful of entries with an "ORF_" prefix, and when the ID/DIV script tries to parse a V call that start with one of those names (with (v_call|V_gene)=((IG[HKL]V[^*]+)[^,\s]+)
) it doesn't recognize any valid V calls for the sequence, so it puts "unknown" and "NA" in the output table for the V call and identity.
I wouldn't think it would much matter for sequences assigned to the ORFs anyway, except that I notice in practice those are usually followed by one or more regular matches (e.g. v_call=ORF_IGHV3-AHH-X*01,IGHV3-AFR*01
) so the effect is to exclude those sequences even though they often do have a regular V call available. Would it work to either allow characters before the "IG" in the pattern, or split on the comma and select from the resulting list? I can propose something if one of those sounds preferable.
Sorry, this fell through the cracks. I think the best answer is to remove the ORFs from the default database, which will happen when better databases are released (I think expected this year). But the regex is a bit brittle regardless. I think I have a more complicated one elsewhere (3.2 and/or 4.4) that can be copied over...
No problem, I wasn't even going to bother making an issue at first until I realized the non-ORF matches got missed if there's an ORF one in front. I also don't think it comes up at all often for us (and even less so now that I'm generally using KIMDB for rhesus heavy chain, though a more recent database for all loci would be great).