tcrdist3 icon indicating copy to clipboard operation
tcrdist3 copied to clipboard

Wrong/redundant entries in db file?

Open nh3 opened this issue 2 years ago • 2 comments

Hello,

There seem to be wrong/redundant entries in alphabeta_gammadelta_db.tsv that place TRAV under "B" chain, e.g. https://github.com/kmayerb/tcrdist3/blob/master/tcrdist/db/alphabeta_gammadelta_db.tsv#L1053. Is it expected?

nh3 avatar Mar 13 '23 15:03 nh3

That is wrong. For human it's luckily limited to duplicated alpha chains wrongly classified as beta. So at least it's easy to tell, which entries are wrong.

In [1]: import pandas as pd
In [2]: df = pd.read_csv("miniforge3/envs/tcrdist3-0.2.2/lib/python3.10/site-packages/tcrdist/db/alphabeta_gammadelta_db.tsv", sep="\t")
In [3]: df = df[df['organism'] == 'human']
In [4]: m = df['id'].duplicated(keep=False)
In [5]: sum(m)
Out[5]: 206
In [6]: df[m]['chain'].value_counts()
Out[6]:
chain
A    103
B    103
Name: count, dtype: int64
In [7]: sum(df[m]['id'].str.startswith('TRAV'))
Out[7]: 206

andreas-wilm avatar Sep 02 '24 08:09 andreas-wilm

With latest version we've changed the reference DB file:

The default is now --- combo_xcr_2024-03-05.tsv

from tcrdist.repertoire import TCRrep
import pandas as pd
data = pd.DataFrame({'v_b_gene':['TRBV5-1*01'], 'cdr3_b_aa':['CASSSSSF']})
tr = TCRrep(cell_df = data, organism = "human", chains = ['beta'])
print(tr.db_file)
print(tr.all_genes.keys())
print(tr.all_genes['human'].keys())

kmayerb avatar Sep 03 '24 16:09 kmayerb