scGPT icon indicating copy to clipboard operation
scGPT copied to clipboard

Max sequence length differs in training and test sets

Open tjeng opened this issue 1 year ago • 2 comments

Hi,

I noticed in the scGPT paper you mentioned that for pre-training, you set a maximum context length of 1,200. However, in the fine-tuning notebook for annotation, the maximum sequence length is 3,000. It seems that the model is able to infer if the context length for the testing set is different from the training set due to absence of positional embedding, is that correct? Does the performance differ if the maximum sequence length is different for training and inference sets, vs if the context length is the same for both data sets?

tjeng avatar Jul 15 '24 16:07 tjeng

Hello, I also encuntered same questions. do you have idea now? I changed it to 3000 but the code still worked, but i am not sure if the pretrained model still works or not. When i change it, the results will be a little different.

doulijun777 avatar Oct 23 '25 22:10 doulijun777

I guess, in this model, gene expression data are converted into tokens then pre-trained. So, no matter how many genes we have, fine-tune process still works, because this model have pre-trained token embeddings.

doulijun777 avatar Oct 24 '25 02:10 doulijun777