scGPT [help needed] complete loss of performance after fine-tuning for integration

[help needed] complete loss of performance after fine-tuning for integration

Open xoelmb opened this issue 7 months ago • 0 comments

Hi! I'm trying to fine-tune scGPT for integration of multiple libraries. After fine-tuning for 5 epochs (recommended is 15 epochs, each takes 8h to run), I have embedded the cells using the saved model and the result looks nothing like the zero-shot embedding, which was quite good, honestly!

First, pictures of what I mean (top is zero-shot, bottom is fine-tuned 5 epochs)

1. time variable and same as categorical

2. annotation

3. an example of a gene with biological signal

Second, the code of how I loaded my dataset.

adata = sc.read_h5ad('../data/combined/adata.h5ad')
ori_batch_col = "library"
adata.obs["celltype"] = adata.obs["celltype_v1"].astype("category")
adata.var = adata.var.set_index("Symbol")
data_is_raw = True

The rest of the parameters in the notebook are not modified from the proposed tutorial. To embed the data, I had to provide vocab and args JSON files, which I just copied from the pretrained model that I fine tuned.

Specific questions:

My batch variable is the library, and there are >100 different libraries in the dataset.
Must the celltype and batch variables in the metadata be categorical or it does not matter?
Which model should we fine-tune? The one pretrained on whole body, or the whole body + continuous training?

Any clue of what's going on with my fine-tuning? There are just too many parameters that I could try to debug in the notebook, so any advice from the authors or the community is deeply appreciated.

Jul 19 '24 14:07 xoelmb

scGPT scGPT copied to clipboard

[help needed] complete loss of performance after fine-tuning for integration

1. time variable and same as categorical

2. annotation

3. an example of a gene with biological signal

scGPT
scGPT copied to clipboard