seurat Performing integration on datasets normalized with SCTransform or not?

Hi,

I'm analyzing a dataset that has two conditions (Normal and Tumor). I'm using vignette integration. In this case it is better to use normalization sctransform or not? I tried to use sctransform but when rescuing all the genes (not just the anchors) it returns the error: vector memory exhausted.

What is the best normalization option in this case, for 1 dataset with 2 conditions?

Thanks!

Aug 29 '22 19:08 PilanEli

Hi,

I am thankful for the feedback!

The dataset had 57,530, 57,530 cells , from 24 primary tumors and 11 control. My Mac had 16GB memory RAM. Can you please recommend an article?

Best regards,

Eliane

On Tue, Aug 30, 2022 at 5:36 AM f6v @.***> wrote:

I think you're conflating two things.

First, your error might be related to the dataset size and/or hardware configuration. How many cells and how much RAM do you have?

Second, I can tell you with 100% certainty that nobody is going to tell you there's some "the right way" for your dataset. There're some cases where data integration might be appropriate. But there're many considerations depending on the number of samples in each condition, sample type, downstream applications, etc. To develop and intuition, I'd suggest to go through several dozen recent scRNA-seq papers. Some of these use some kind of "batch effect correction", and some don't.

— Reply to this email directly, view it on GitHub https://github.com/satijalab/seurat/issues/6358#issuecomment-1231341179, or unsubscribe https://github.com/notifications/unsubscribe-auth/AW63BEZOZJ7TIESCH2XRT5DV3XBZPANCNFSM5764TBIA . You are receiving this because you authored the thread.Message ID: @.***>

--

*Eliane Graciela Pilan - * CRBio: 116752/01-D MSc, PhD Candidate - Biological Sciences (Genetics) Department of Structural and Functional Biology, Botucatu São Paulo State University (Unesp), Institute of Biosciences, Botucatu https://orcid.org/0000-0003-1846-3380

Aug 30 '22 13:08 PilanEli

@PilanEli sorry I removed the post since I thought it was less relevant.

I can certainly tell that 16 GB isn't enough for this dataset. You're looking at 32-64 range even when using RPCA for the integration. Feel free to message me at ihor.filippov at ut dot ee. I'd be glad to give some hints.

Aug 31 '22 14:08 f6v

Hi

I am thankful for the feedback!

The dataset had 57.530 cells , from 24 primary tumors and 11 control. I integrated the groups (tumor and normal). However, my master claims that we must rescue all genes and not just anchors. By rescuing only the anchors, can we lose important genes? Example: genes that are expressed only in tumor cells?

Could you recommend an article on when to use the integration and when not to?

Best regards,

Eli

On Wed, Aug 31, 2022 at 11:46 AM f6v @.***> wrote:

@PilanEli https://github.com/PilanEli sorry I removed the post since I thought it was less relevant.

I can certainly tell that 16 GB isn't enough for this dataset. You're looking at 32-64 range even when using RPCA for the integration. Feel free to message me at ihor.filippov at ut dot ee. I'd be glad to give some hints.

— Reply to this email directly, view it on GitHub https://github.com/satijalab/seurat/issues/6358#issuecomment-1233036562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AW63BE7ITKRH6T3EPTBU25TV35V4NANCNFSM5764TBIA . You are receiving this because you were mentioned.Message ID: @.***>

--

*Eliane Graciela Pilan - * CRBio: 116752/01-D MSc, PhD Candidate - Biological Sciences (Genetics) Department of Structural and Functional Biology, Botucatu São Paulo State University (Unesp), Institute of Biosciences, Botucatu https://orcid.org/0000-0003-1846-3380

Sep 06 '22 03:09 PilanEli

The dataset had 57.530 cells , from 24 primary tumors and 11 control. I integrated the groups (tumor and normal)

Why? Were tumors and controls sequenced separately? If so, why? That'd confound the findings. If tumors and controls are sequenced in the same batches, why do you choose to integrate by condition and not batch? Differences between tumor and normal tissues are very likely driven by biology(in case your batch structure is statistically sound). I mean, there're many things you need to take into account.

You are going to "loose" genes in any case since the typical workflow selects HVGs and performs PCA, UMAP, etc. on a feature subset. That subset is usually in the range 2000-3000 genes. That's the case regardless whether you choose to adjust for batch effect or not. I believe that integration features are selected by looking for HVGs in the samples. That doesn't affect the differential gene expression though, since you can specify all the genes in the dataset.

As for the article. I'd suggest to look for papers citing Seurat integration, Harmony, BBKNN, etc. You'll see that the choice of whether to use batch correction or not is very subjective. Look for figures in supplements, like Extended Data Fig. 4 in https://doi.org/10.1038/s41591-019-0522-3 You can notice that some of the clusters are driven by one patient. But the authors very find with that.

Sep 06 '22 14:09 f6v

Hi Satijalab/Seurat!

Thanks very much! I'm analyzing a dataset in which the sequencing was done by sample. Each sample corresponds to a different patient. The dataset is available at: GSA: CRA001160. In this case, should the integration be by sample/patient?

Best Regards,

Eliane

On Tue, Sep 6, 2022 at 11:57 AM f6v @.***> wrote:

The dataset had 57.530 cells , from 24 primary tumors and 11 control. I integrated the groups (tumor and normal)

Why? Were tumors and controls sequenced separately? If so, why? That'd confound the findings. If tumors and controls are sequenced in the same batches, why do you choose to integrate by condition and not batch? Differences between tumor and normal tissues are very likely driven by biology(in case your batch structure is statistically sound). I mean, there're many things you need to take into account.

You are going to "loose" genes in any case since the typical workflow selects HVGs and performs PCA, UMAP, etc. on a subset. That subset is usually in the range 2000-3000 genes. That's the case regardless whether you choose to adjust for batch effect. I believe that integration features are selected by looking for HVGs in the samples. That doesn't affect the differential gene expression though, since you can specify all the genes in the dataset.

As for the article. I'd suggest to look for papers citing Seurat integration, Harmony, BBKNN, etc. You'll see that the choice of whether to use batch correction or not is very subject. Look for figures in supplements, like Extended Data Fig. 4 in https://doi.org/10.1038/s41591-019-0522-3 You can notice that some of the clusters are driven by one patient. But the authors very find with that.

— Reply to this email directly, view it on GitHub https://github.com/satijalab/seurat/issues/6358#issuecomment-1238270380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AW63BE3AQD2BQDR4EMCLX2LV45LWVANCNFSM5764TBIA . You are receiving this because you were mentioned.Message ID: @.***>

--

*Eliane Graciela Pilan - * CRBio: 116752/01-D MSc, PhD Candidate - Biological Sciences (Genetics) Department of Structural and Functional Biology, Botucatu São Paulo State University (Unesp), Institute of Biosciences, Botucatu https://orcid.org/0000-0003-1846-3380

Sep 08 '22 01:09 PilanEli

@PilanEli I think you're missing a point. There isn't going to be an answer to that question without processing the data and looking at some plots. You need to run a workflow without integration and determine if there's a batch structure in your data that's distinct from biological differences.

Sep 09 '22 08:09 f6v

It is difficult to answer this question without a bit more context. If there are strong batch effets when analyzing both the datasets (no integration), you would need integration.

Oct 07 '22 15:10 saketkc

Hi, I thank you for the guidance and clarification. I believe that we are not correctly understanding the purpose of integration. Control samples versus tumors from different patients do not necessarily meet the different conditions integration condition. It's would fall under Different batches (e.g. when experimental conditions make batch processing of samples necessary), am I right? I will process the data and verify according to your guidelines and will make more reads about these questions.

Best regards,

On Fri, Sep 9, 2022 at 5:12 AM f6v @.***> wrote:

@PilanEli https://github.com/PilanEli I think you're missing a point. There isn't going to be an answer to that question without processing the data and looking at some plots. You need to run a workflow without integration and determine if there's a batch structure in your data that's distinct from biological differences.

— Reply to this email directly, view it on GitHub https://github.com/satijalab/seurat/issues/6358#issuecomment-1241650633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AW63BEYCAXITH4HYTDWKG73V5LWODANCNFSM5764TBIA . You are receiving this because you were mentioned.Message ID: @.***>

--

*Eliane Graciela Pilan - * CRBio: 116752/01-D MSc, PhD Candidate - Biological Sciences (Genetics) Department of Structural and Functional Biology, Botucatu São Paulo State University (Unesp), Institute of Biosciences, Botucatu https://orcid.org/0000-0003-1846-3380

Oct 11 '22 08:10 PilanEli

seurat seurat copied to clipboard

Performing integration on datasets normalized with SCTransform or not?

--

--

--

--

seurat
seurat copied to clipboard