scGPT 74% of the tokens in adata.var[feature_name] are not in vocab. Please check if using the correct vocab and token

Hello!

First, I would like to express my appreciation for your impressive work on the project. I've been working with the pretraining data and unfortunately, I've run into a similar issue as previously reported in issue #139. Unfortunately, it appears that there hasn't been a response to that issue yet.

I am encountering a ValueError indicating that a significant number of tokens in adata.var[feature_name] are not present in the vocabulary. This seems to be a common issue since, upon reviewing the scg.scbank.databank code, I noticed that there's a validation step where tokens are checked against the vocabulary:

# validate matching between tokens and vocab
tokens = adata.var[token_col].tolist()
match_ratio = sum([1 for t in tokens if t in self.gene_vocab]) / len(tokens)
if match_ratio < 0.9:
    raise ValueError(
        f"{match_ratio*100:.0f}% of the tokens in adata.var[{token_col}] are not in vocab. Please check if using the correct vocab and token_col."
    )

According to this, if the match_ratio is less than 0.9, the process raises an error: {match_ratio*100:.0f}% of the tokens in adata.var[{token_col}] are not in vocaband seems to skip processing those files.

Could you please advise on how to resolve this issue? Is there an updated vocabulary that I should be using, or perhaps a different token_col setting that aligns better with the available data?

Thank you very much for your time and assistance. I look forward to your guidance on resolving this challenge.

Jul 17 '24 19:07 Liwer-S

I am not the author, but I encountered a similar issue and resolved it by updating the vocabulary. I think you can use expand_gene_list.py to update the vocabulary to your version (the same as specified in data_config.py). After the update, there shouldn't be many genes missing from the vocabulary when you convert .h5ad to .scb. Although, there seems to be a typo in the error message( "{(1-match_ratio)*100:.0f}%" of the tokens ...).

Aug 01 '24 04:08 q225yang

@q225yang How long did it take to finetune model with the updated vocabulary, until it reached sufficiently low loss for you ? Isn't it almost equivalent to pretraining if you add too much vocab ?

Jul 12 '25 06:07 Khachdallak02

74% of the tokens in adata.var[feature_name] are not in vocab. Please check if using the correct vocab and token_col.