tcrdist3
tcrdist3 copied to clipboard
[Request] Improve Documentation for tabulate
Just a tiny issue. I was going through your meta-clonotype discovery and tabulation script, modifying it to work with my mouse data, but I kept having issues with the tabulation script. I got an error that tabulate
requires a column named 'productive_frequency'
in the bulk data, clone_df2
, yet this is not specified in the documentation anywhere. You are sure to have the other required columns in the dataframe after using TCRrep
(cdr3, v, j, and count). After adding a frequency column, it worked as expected!
Thank you so much for the very useful package!
HI gcohenJH,
Thanks for sharing your experience and area for improvement in the docs.
Out of curiosity did you happen to see the TCRjoin feature for tabulation. I think it will allow tabulation without a frequency column:
https://tcrdist3.readthedocs.io/en/latest/join.html
Could you add a snippet of where you got the error?
Thanks, k
Sure. Here's the chunk where I'm getting the error. It's exactly the same as https://tcrdist3.readthedocs.io/en/latest/metaclonotypes.html . Preceding this is just the code where I load my bulk data and rename columns.
tr_search.cpus = 4
tic = time.perf_counter()
tr_search.compute_sparse_rect_distances(df = tr_search.clone_df, df2 = tr_bulk.clone_df, chunk_size = 50, radius = 50)
results = tabulate(clone_df1 = tr_search.clone_df, clone_df2 = tr_bulk.clone_df, pwmat = tr_search.rw_beta)
toc = time.perf_counter()
print(f"TABULATED IN {toc - tic:0.4f} seconds")
Here's the error I was getting when I didn't include the productive_frequency column.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\Miniconda3\envs\tcrdist3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
~\Miniconda3\envs\tcrdist3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
~\Miniconda3\envs\tcrdist3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'productive_frequency'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13016/817900842.py in <module>
2 tic = time.perf_counter()
3 tr_search.compute_sparse_rect_distances(df = tr_search.clone_df, df2 = tr_bulk.clone_df, chunk_size = 50, radius = 50)
----> 4 results = tabulate(clone_df1 = tr_search.clone_df, clone_df2 = tr_bulk.clone_df, pwmat = tr_search.rw_beta)
5 toc = time.perf_counter()
6 print(f"TABULATED IN {toc - tic:0.4f} seconds")
~\Miniconda3\envs\tcrdist3\lib\site-packages\tcrdist\tabulate.py in tabulate(clone_df1, clone_df2, pwmat, cdr3_name, v_gene_name, j_gene_name)
85 # Retrieve abundances from the bulk clone df
86 icounts = [clone_df2['count'].iloc[x].to_list() for x in icol]
---> 87 ifreqs = [clone_df2['productive_frequency'].iloc[x].to_list() for x in icol]
88
89 isumcounts = [np.sum(x) for x in icounts]
~\Miniconda3\envs\tcrdist3\lib\site-packages\tcrdist\tabulate.py in <listcomp>(.0)
85 # Retrieve abundances from the bulk clone df
86 icounts = [clone_df2['count'].iloc[x].to_list() for x in icol]
---> 87 ifreqs = [clone_df2['productive_frequency'].iloc[x].to_list() for x in icol]
88
89 isumcounts = [np.sum(x) for x in icounts]
~\Miniconda3\envs\tcrdist3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
~\Miniconda3\envs\tcrdist3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'productive_frequency'
As far as I can tell, tabulate is asking for a column named productive frequency in the df2, and pandas can't find that column in the dataframe so its giving an error.
join_by_dist seems more like what I would want for tabulation though. Thank you for the help.