scGPT
scGPT copied to clipboard
[help needed] complete loss of performance after fine-tuning for integration
Hi! I'm trying to fine-tune scGPT for integration of multiple libraries. After fine-tuning for 5 epochs (recommended is 15 epochs, each takes 8h to run), I have embedded the cells using the saved model and the result looks nothing like the zero-shot embedding, which was quite good, honestly!
First, pictures of what I mean (top is zero-shot, bottom is fine-tuned 5 epochs)
1. time variable and same as categorical
2. annotation
3. an example of a gene with biological signal
Second, the code of how I loaded my dataset.
adata = sc.read_h5ad('../data/combined/adata.h5ad')
ori_batch_col = "library"
adata.obs["celltype"] = adata.obs["celltype_v1"].astype("category")
adata.var = adata.var.set_index("Symbol")
data_is_raw = True
The rest of the parameters in the notebook are not modified from the proposed tutorial. To embed the data, I had to provide vocab and args JSON files, which I just copied from the pretrained model that I fine tuned.
Specific questions:
- My batch variable is the library, and there are >100 different libraries in the dataset.
- Must the celltype and batch variables in the metadata be categorical or it does not matter?
- Which model should we fine-tune? The one pretrained on whole body, or the whole body + continuous training?
Any clue of what's going on with my fine-tuning? There are just too many parameters that I could try to debug in the notebook, so any advice from the authors or the community is deeply appreciated.