tcrdist3
tcrdist3 copied to clipboard
Gene names should allow alternate delimiters
I recently ran into a UserWarning
when trying to run tcrdist on several gene names in my dataset:
tcrdist/repertoire.py:504: UserWarning: TRAV16D-DV11*01 gene was not recognized in reference db no cdr seq could be inferred
This is because the gene names expected by tcrdist
use /
delimiters in certain gene names (e.g. TRAV13-4/DV7*01
) but my dataset uses gene names with -
characters as this delimiter (in this case, TRAV13-4-DV7*01
).
Is there an easy way to modify tcrdist
to support either one? At the moment, I am using a simple fix to bypass this issue by mapping all my gene IDs to the tcrdist
/
-version as follows:
from pathlib import Path
# Get db file that will be used in tcrdist `TCRep` constructor
tcr_db_path = Path("~/path/to/tcrdist") / "db" / "alphabeta_gammadelta_db.tsv"
tcr_db = pd.read_table(tcr_db_path)
# Create mapping from `/` characters to `-` (trivial to replace the `/` with a `-`, considering they converge on the 'all dashes' version), then use reverse mapping to get correct (according to `tcrdist`) name
original_id_list = tcr_db.id
gene_id_list = original_id_list.apply(lambda x: x.replace("/", "-"))
gene_id_dict = dict(zip(gene_id_list, original_id_list))
# Use dict to replace gene IDs in dataset
dff["v_a_gene"] = dff["v_a_gene"].apply(lambda x: gene_id_dict.get(x, x))