celltypist Immune_All_Low - model training question

Hi I was trying to replicate your Immune_All_Low model by using the same training dataset that you kindly provided (CellTypist_Immune_Reference_v2_count). I tested the two models (original and the replicate) on a independent dataset. The final annotations differ quite a bit, especially I notice that the prediction scores from the original model are significantly higher (0-1) than of the replicate one (mostly around 0). Here is the model training code:

train_data = sc.read("./CellTypist_Immune_Reference_v2_count.h5ad")
gene_list = original_immune_all_low_classifier.features

# include only genes that are in train_data.var_names
valid_genes = [gene for gene in gene_list if gene in train_data.var_names]

sc.pp.normalize_total(train_data, target_sum=1e4)
sc.pp.log1p(train_data)

train_data = train_data[:, valid_genes] #subset to markers used by the original model

Classifier = celltypist.train(train_data, labels = 'label', n_jobs = 16, feature_selection = False, use_SGD = True, mini_batch = False, check_expression = False, balance_cell_type =False)

I also tried to see if it would make a difference if I would normalize after the gene subsetting but nothing changed much. Thanks for the help!

Jan 10 '24 14:01 NikicaJEa

I am also interested in this

Jan 22 '24 15:01 KatarinaLalatovic

@NikicaJEa, if you only select a subset of genes for training, SGD is not necessary - you can safely turn it off to enable a canonical logistic regression.

Jan 22 '24 21:01 ChuanXu1

Thanks for your reply @ChuanXu1. I experimented with all possible combinations I could think of: with/without SGD, mini batching, balance_cell_type, without subseting genes, with feature selection step. Unfortunately, none of these combinations yielded results comparable to the original immune_low model. Now I understand there is always some degree of randomness to be expected, but this is more than I would expect. It would be beneficial to understand the specific parameters under which the original model was trained.

Jan 23 '24 08:01 NikicaJEa

@NikicaJEa, to produce a model with comparable performance versus the built-in models, you can use the same set of genes (which you had already done plus check_expression = False) and increase the number of iterations (for example, max_iter = 1000), with all other parameters being the defaults.

Jan 26 '24 20:01 ChuanXu1