cellrouter Too many subpopulations?

Too many subpopulations?

Open MingBit opened this issue 6 years ago • 10 comments

Hi,

Thanks again for the great work. :) I'm testing cellrouter with our own data (Two conditions at day 3). There are ~20 cell sub-populations identified. K = 12 which was defined by findK function. I did try other K values as well.

There are relatively few clusters identified by SC3, which seems to be close to our expectation. So I'm wondering if cellrouter tends to give many sub-populations, even though the input data is collected from two conditions at single timepoint.

Looking forward to your response.

May 02 '18 13:05 MingBit

Hi,

Thank you for your using our software! The clustering algorithm that cellrouter implements at this point aims at identifying more clusters to allow reconstruction of trajectories between specific locations in the dimension reduction plot. I also noticed that for some datasets this might not be ideal. So, you can either increase K to have less populations, such that you obtain a number of clusters similar to the one that you obtain with SC3. I am working on to optimize this step in CellRouter and I am also including an option to use previously identified clusters as input. So, you could use your SC3 clusters as input to CellRouter. Unfortunately, this will take me about 2 weeks to finish. So, the quickest solution would be to increase K.

Please, let me know if that helps! I am working on to improve cellrouter and comments/suggestions are very welcome!

Thanks a lot!

2018-05-02 9:13 GMT-04:00 MingBit [email protected]:

Hi,

Thanks again for the great work. :) I'm testing cellrouter with our own data (Two conditions at day 3). There are ~20 cell sub-populations identified. K = 12 which was defined by findK function.

There are relatively few clusters identified by SC3 ([ https://github.com/hemberg-lab/SC3]), which seems to be close to our expectation. So I'm wondering if cellrouter tends to give many sub-populations, even though the input data is collected from two conditions at single timepoint.

Looking forward to your response.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/edroaldo/cellrouter/issues/10, or mute the thread https://github.com/notifications/unsubscribe-auth/AJqUR4-t_0XlQlyKy97gj4RqW960JNkNks5tubELgaJpZM4TvbuN .

-- Edroaldo

May 02 '18 13:05 edroaldo

Thanks for your explanation. :) In terms of K values, as I understood, cell sub-populations are identified from the generated KNN graph and then detected by Louvain community detection method. Sub-populations will be further used for trajectory analysis. In your tutorial example, K=5 was used for cells clustering and K=10 was used for trajectory analysis. So I'm wondering should the K values be identical in two analysis part? Otherwise, cell-subpopulations would be different. Thanks.

May 03 '18 10:05 MingBit

You are exactly right. It is fine to use different values of K for clustering and trajectory identification. For example, clusters identified with k=5 will bethe ones used for trajectory analysis between subpopulations. You can choose another value for K for trajectory analysis when your knn graph is not fully connected or when the subpopulations in the transition that you want to study are not connected in the kNN graph. Regardless of the value of K that you choose for the trajectory analysis step, the subpopulations used will be the ones identified in the first step. In the first tutorial in github, in section "Starting trajectory analysis", the first figure shows that connections between the subpopulations. This basically shows the connections/edges in the kNN visualized in the tSNE space. In that figure, if you want to study the transitions from 24 to 2, you will need to increase K, such that clusters 3 or 4 will be connected to cluster 2.

Please, let me know if it is clear...

Thanks!

2018-05-03 6:46 GMT-04:00 MingBit [email protected]:

Thanks for your explanation. :) In terms of K values, as I understood, cell sub-populations are identified from the generated KNN graph and then detected by Louvain community detection method. Sub-populations will be further used for trajectory analysis. In your tutorial example https://github.com/edroaldo/cellrouter/blob/master/stemid/StemID_BM_CellRouter.md, K=5 was used for cells clustering and K=10 was used for trajectory analysis. So I'm wondering should the K values be identical in two analysis part? Otherwise, cell-subpopulations would be different. Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/edroaldo/cellrouter/issues/10#issuecomment-386255953, or mute the thread https://github.com/notifications/unsubscribe-auth/AJqUR0fHA5dOJ4v_GWfJcFVmuhyOiRAHks5tut_-gaJpZM4TvbuN .

-- Edroaldo

May 03 '18 12:05 edroaldo

Yesss... It's clear enough for me about choosing K values. :D Since I currently only have SC RNA-seq datasets at the limited time point, I'm more interested in cell clustering parts. ^_^ As I noticed, some trajectory analysis tools (e.g. Wandlust, Monocle2 and Scanpy) tend to use graph-based methods(e.g. KNN+Louvain clustering) for sub-populations identification. And some other tools, including SC3, CIDR and RaceID, use either K-means or hierarchical clustering to perform cells clustering. I've played above packages a little bit and found that, for the dataset with two or three conditions at one time point, graph-based methods tend to give more clusters compared to K-means/hierarchical clustering.

So in terms of the those single time point datasets, I'm wondering that might that be better not use graph-based methods for clustering if I just wanna do cell sup-populations identification and differential gene expression analysis. Please correct me if I was mistaken. :)

Looking forward to your suggestion.

May 04 '18 12:05 MingBit

Ah! .. one more question about gene markers. So K values were increased a bit for getting less sup-populations. I learned that differential gene expression analysis in CellRouter is performed based on mean expression values. I tried to create feature plots by plotDRExpression() for top differentially expressed genes of each sub-population. Two normalisation methods(log, z-score) were utilised, unfortunately they look relatively gradual and discrete. So it seems that identified clusters are not optimal and DEG analysis is dramatically affected by drop-out zeros. I'm wondering Is there any other possibilities we could estimate K if findK() cannot give a optimal value. Or perhaps I have to go back for the feature preprocessing (e.g feature selection, gene Imputation)

Sry for my frequent posting... >_<...

CellRouter is very interesting for us and I'll be presenting this method in our group.. :D

May 04 '18 13:05 MingBit

Hi, I think you are correct in your observations. It usually require some iterations to identify clusters and signatures that make sense biologically.

I am now actively working to improve the clustering part and also the differential expression component of cellrouter. I hope to update the github page some point late next week. Will be glad to hear your feedback on it! I am trying to extend CellRouter further to be a more complete tool...

Thanks a lot!

On Fri, May 4, 2018, 8:37 AM MingBit [email protected] wrote:

Yesss... It's clear enough for me about choosing K values. :D Since I currently only have SC RNA-seq datasets at the limited time point, I'm more interested in cell clustering parts. ^_^ As I noticed, some trajectory analysis tools (e.g. Wandlust, Monocle2 https://github.com/cole-trapnell-lab/monocle-release and Scanpy https://github.com/theislab/scanpy) tend to use graph-based methods(e.g. KNN+Louvain clustering) for sub-populations identification. And some other tools, including SC3 https://github.com/hemberg-lab/SC3, CIDR https://github.com/VCCRI/CIDR/blob/master/R/CIDR.R and RaceID https://github.com/dgrun/RaceID/blob/master/RaceID_class.R, use either K-means or hierarchical clustering to perform cells clustering. I've played above packages a little bit and found that, for the dataset with two or three conditions at one time point, graph-based methods tend to give more clusters compared to K-means/hierarchical clustering.

So in terms of the those single time point datasets, I'm wondering that might that be better not use graph-based methods for clustering if I just wanna do cell sup-populations identification and differential gene expression analysis. Please correct me if I was mistaken. :)

Looking forward to your suggestion.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/edroaldo/cellrouter/issues/10#issuecomment-386588265, or mute the thread https://github.com/notifications/unsubscribe-auth/AJqURyOatUAs1MtR_P0r753hqrFdyCMzks5tvEuXgaJpZM4TvbuN .

May 04 '18 15:05 edroaldo

The current findK function should not be used. I am also including another clustering algorithm as part of CellRouter, based on model based clustering, and I will also make available an option to provide as input clusters identified by other tools.

I hope you can wait to next release o.o CellRouter next week to try that out.

I also noticed that the analysis looks way better when data imputation methods are used, such as MAGIC or scImpute.

Hope it helps! I am actively working to release the new version of CellRouter next week.

Thank you very much for your interest in our work!

On Fri, May 4, 2018, 9:49 AM MingBit [email protected] wrote:

Ah! .. one more question about gene markers. So K values were increased a bit for getting less sup-populations. I learned that differential gene expression analysis in CellRouter is performed based on mean expression values. I tried to create feature plots by plotDRExpression() for top differentially expressed genes of each sub-population. Two normalisation methods(log, z-score) were utilised, unfortunately their changes look relatively gradual. So it seems that identified clusters are not optimal. I'm wondering Is there any possibilities we could estimate K if findK() cannot give a optimal value. Sry for my frequent posting...

_<...

CellRouter is very interesting for us and I'll be presenting this method in our group.. :D

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/edroaldo/cellrouter/issues/10#issuecomment-386607300, or mute the thread https://github.com/notifications/unsubscribe-auth/AJqUR9bCoETLnp5dfiJ6O56ZytltjmP2ks5tvFyBgaJpZM4TvbuN .

May 04 '18 16:05 edroaldo

Hi, I think you are correct in your observations. It usually require some iterations to identify clusters and signatures that make sense biologically.

Thanks a lot!

On Fri, May 4, 2018, 9:49 AM MingBit [email protected] wrote:

Ah! .. one more question about gene markers. So K values were increased a bit for getting less sup-populations. I learned that differential gene expression analysis in CellRouter is performed based on mean expression values. I tried to create feature plots by plotDRExpression() for top differentially expressed genes of each sub-population. Two normalisation methods(log, z-score) were utilised, unfortunately their changes look relatively gradual. So it seems that identified clusters are not optimal. I'm wondering Is there any possibilities we could estimate K if findK() cannot give a optimal value. Sry for my frequent posting...

_<...

CellRouter is very interesting for us and I'll be presenting this method in our group.. :D

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/edroaldo/cellrouter/issues/10#issuecomment-386607300, or mute the thread https://github.com/notifications/unsubscribe-auth/AJqUR9bCoETLnp5dfiJ6O56ZytltjmP2ks5tvFyBgaJpZM4TvbuN .

May 04 '18 16:05 edroaldo

Hey edroaldo,

I'm very looking forward to the next version of CellRouter. 👍 Hmm.. Concerning the GRN score, I'm a little bit confused by m_t,j or m_i,j. So in this formula: screen shot 2018-05-06 at 13 05 35

m_t,j is the mean correlation of predicted targets of gene i regulated along trajectory j

And it was mentioned again here:

Moreover, if its predicted target genes are also well correlated with the differentiation trajectory, it is more likely that the regulator is important (parameter m_i,j)

I'm wondering are they actually the same? If no, what is t in m_t,j? time series? :D Thank you and Looking forward to your reply.

May 06 '18 11:05 MingBit

That's is a typo. It should be m_i,j. I will check with the journal how we could publish a correction for this.

Thanks!

2018-05-06 7:10 GMT-04:00 MingBit [email protected]:

Hey edroaldo,

I'm very looking forward to the next version of CellRouter. 👍 Hmm.. Concerning the GRN score, I'm a little bit confused by m_t,j or m_i,j. So in this formula: [image: screen shot 2018-05-06 at 13 05 35] https://user-images.githubusercontent.com/22442392/39672625-32c3be28-512e-11e8-8a41-fc21577bbca6.png

m_t,j is the mean correlation of predicted targets of gene i regulated along trajectory j

And it was mentioned again here:

Moreover, if its predicted target genes are also well correlated with the differentiation trajectory, it is more likely that the regulator is important (parameter m_i,j)

I'm wondering are they actually the same? If no, what is t in m_t,j??? Thank you and Looking forward to your reply. :D

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/edroaldo/cellrouter/issues/10#issuecomment-386871593, or mute the thread https://github.com/notifications/unsubscribe-auth/AJqUR09770kOkvq0qISz49H0iXi8NrjTks5tvtowgaJpZM4TvbuN .

-- Edroaldo

May 08 '18 16:05 edroaldo

cellrouter cellrouter copied to clipboard

Too many subpopulations?

cellrouter
cellrouter copied to clipboard