seurat icon indicating copy to clipboard operation
seurat copied to clipboard

Which is a more sensible SCTransform-integration approach to account for donor variability and between-dataset sequencing depth?

Open denvercal1234GitHub opened this issue 1 year ago • 3 comments

Related to #4161, I have 6 scRNA-seq datasets of B cells as in the picture below.

From merging analysis, CD39 status did not impact the clustering, but I saw a high donor variability (donor-specific clusters). So I aim to integrate by donor_ID.

My goal is to investigate the heterogeneity of these B cells with a particular interest in CD39 expressing B cells. I want to correct for donor variability and sequencing depth between datasets.

Question 1. If I should do SCTransform() each of 6 datasets individually (or within each CD39+ or CD39- dataset?) (given high sample-effect #2826 #1580), how can I account for sequencing depth BETWEEN these 6 datasets?

I thought SCTrasnform is per-cell normalization, so it would not account for sequencing depth between datasets. Or, we do not need to care about sequencing depth difference between datasets, just need to account for that within a dataset?

Question 2. Which approach below would be better for my goal above?

Question 3. If I integrate by donor_ID, do I need to also set latent.var="donor_ID" in FindMarkers()?

Thank you so much for your advice!

Approach 1:

  1. Skip cellranger aggr step
  2. SCTransform, setting vars.to.regress = c("each_dataset_ID", "percent.mt") on each of 6 datasets individually
  3. Integrate by donor_ID the 3 donors within the CD39+ dataset.
  4. Integrate by donor_ID the 3 donors within the CD39- dataset.
  5. Merge CD39+ and CD39- dataset
  6. Cluster then FindMarkers(latent.var="donor_ID")
Screen Shot 2022-09-01 at 9 52 04 AM

Approach 2: Screen Shot 2022-09-01 at 9 50 29 AM

Approach 3:

  1. Skip cellranger aggr step
  2. Merge all 6 objects
  3. SCTransform settingvars.to.regress = c("percent.mt", "each_dataset_ID") on the merged obj to adjust for percent.mt and differences between datasets, including sequencing depth
  4. Split back into 6 individual objects
  5. Integrate by donor_ID the 3 donors within the CD39+ dataset
  6. Integrate by donor_ID the 3 donors within the CD39- dataset
  7. Merge CD39+ and CD39- dataset
  8. Cluster then FindMarkers(latent.var=c("donor_ID")

denvercal1234GitHub avatar Aug 30 '22 12:08 denvercal1234GitHub

I saw a high donor variability (donor-specific clusters)

Is that completely unexpected in your dataset? If there was no biological variability, you wouldn't have biological replicates, would you?

Anyway, what I would do is stop overthinking it. Like, I don't get why would you consider performing "Approach 3" and splitting CD39- and CD39+. First question is: what's the batch structure? I didn't get it from the diagrams. Were all the samples sequenced on the same day, different days?

Try running the data integration according to your batch structure and visualise the samples on the UMAP. You can always plot number of genes and UMIs on top of the integrated UMAP to see if the sequencing depth results into "technical" clusters. But it's counter-productive to reason about specific approach without seeing any of the plots.

f6v avatar Sep 01 '22 11:09 f6v

Thank you so much @f6v for your input. The 3 donor datasets for CD39neg were sequenced together. The 3 donor datasets for CD39pos were then later sequenced.

(1) When I simply merged 3 donor datasets within the CD39neg, I saw donor-specific clusters. (2) Same observations when I simply merged 3 donor datasets within the CD39pos. (3) When I merged the 2 CD39neg and CD39pos datasets, I also saw donor-specific clusters, whereas CD39 pos-or-neg status did not drive the clustering when visualized on UMAP.

From these (1-3) observations, I decided to integrate the datasets by donor_ID.

Whether the donor variability is biological or technical is interesting, but I am unsure. What I want to ensure is that the clusters I obtained were found regardless of which donor I used (i.e. clusters should have cells coming from all or most donors). That is I want the clusters to not be driven simply because we by chance picked a donor that might have something special and hence this "technical randomness" produced the clusters we observed.

I agree Approach 1 sounds simpler and better?

denvercal1234GitHub avatar Sep 01 '22 17:09 denvercal1234GitHub

@denvercal1234GitHub in my experience, between-donor effects might be more or less pronounced depending on experiment and simply chance. I've seen some batches that are well-aligned, and others that aren't. Feel free to hit me up at ihor.filippov at ut dot ee if you want to discuss this some time.

f6v avatar Sep 02 '22 08:09 f6v

Hi,

If i understood, you have 6 samples so perform approach 1 :

  • merge all samples together (to get the same matrix length) then split them by donor_id Seurat.combined <- merge(x = ID_1, y = c(ID_2,ID_3,ID_4,ID_5,ID_6)) Seurat.split<-SplitObject(Seurat.combined, split.by = "donor_id")
  • perform SCTransform on each dataset with var.to.regress="percent.mt", there is no need to set "donor_id" (because you apply SCTransform to each dataset). Seurat.split <- lapply(X = Seurat.split, FUN = SCTransform,assay = "Spatial",return.only.var.genes = FALSE, vars.to.regress=c("percent.mt"))
  • Perform integration as described in the vignette (do not run ScaleData on integrated object with SCTransform based workflow)
  • Perform DEG but it is not advised to use the SCT assay it is better to use the RNA assay

Best,

WesDe avatar Oct 05 '22 12:10 WesDe

Question 1. If I should do SCTransform() each of 6 datasets individually (or within each CD39+ or CD39- dataset?) (given high sample-effect https://github.com/satijalab/seurat/issues/2826 https://github.com/satijalab/seurat/issues/1580), how can I account for sequencing depth BETWEEN these 6 datasets?

Ideally yes. PrepSCTFindMarkers() adjusts for sequencing depth variation between datasets. Integration workflow works on residuals directly, so a readjustment is not necessary (it is implicitly done within the workflow).

Question 3. If I integrate by donor_ID, do I need to also set latent.var="donor_ID" in FindMarkers()?

Yes, it might often be helpful to adjust for other covariates when running FindMarkers(). You could use the LR test here.

To clarify, you can perform DE on SCT assay using the data slot as we show in the v2 vignette

saketkc avatar Oct 07 '22 15:10 saketkc