insct Some questions

Do you intend the PCA to be run on the counts data or transformed data?
I assume the PCA is run on the entire dataset (and not per batch)?
Is there any way of using the supervised mode if you only have a subset of the cells labelled?
How would you extract the lower dimensional corrected space? Or should you just set the embedding_dims to something like 30?

May 19 '20 16:05 davemcg

Thanks for your questions!

1. Do you intend the PCA to be run on the counts data or transformed data?

In our analysis we ran PCA on the transformed data following the standard (scanpy) pipeline. Of course there are other dimension reduction techniques that could be used (scVI, DCA) which work on the raw counts. INSCT can be applied to 'any' reduced dimension space derived from gene expression. In principle you can run it even on all genes and the raw counts, however, in our preliminary analyses PCA on logtransformed data appeared to be the best choice.

2. I assume the PCA is run on the entire dataset (and not per batch)?

Yes, the PCA is run on the entire (merged) dataset. This is important as you cannot compare PCs from different analyses easily. This step is still resource intensive. If resources are limited, we suggest to run PCA on a smaller subset of the data and then project the remaining cells into this space to get PCA coordinates for all cells.

3. Is there any way of using the supervised mode if you only have a subset of the cells labelled?

Yes, absolutely. We call this the semi-supervised mode. Here, you need 3 inputs 1) adata object, 2) column name of the anndata object containing the cell type labels and 3) a 'masking' vector which will tell the algorithm to ignore the cell type labels for a selection of cells. Pls see Pancreas example: semi_supervised_model.fit(X = adata_semi_supervised, batch_name='batch', celltype_name='Celltypes', mask_batch= batch)

4. How would you extract the lower dimensional corrected space? Or should you just set the embedding_dims to something like 30?

By transforming the data using a trained model you can derived the embedding, like this: embedding=model.transform(adata). By default the embedding_dims parameter in the TNN() function is set to 2.

Best,

Lukas

May 19 '20 17:05 lkmklsmn

Thanks - a preliminary run looks promising so I'm trying to optimize now.

I saw this, but it appears that the blocking factor is the batch? Or am I reading this wrong? I have a celltype vector which has the 12 celltypes (e.g. "cones", "rods")...the missing ones (which are scattered across the batches) are called 'missing'.

Would I run it like this?

semi_supervised_model.fit(X = adata_semi_supervised, batch_name='batch', celltype_name='celltype', mask_batch= 'missing')

Or would I put all the missing cells into their own batch?

While it is nice that TNN spits out the 2D space for plotting, I find having the "reduced" space crucial for downstream uses like cluster assignment and pseudotime.

Would you recommend running embedding_dims with, say, 30 for use in UMAP, clustering, pseudotime?

May 19 '20 17:05 davemcg

I see your point. If the missing cells are scattered across batches, you would have to assign them into a specific batch called 'missing'. We realize this may not be to most straight forward implementation and will consider changes.

To be honest, we have not systematically evaluated using the multi-dimensional embedding in downstream analysis, as suggested by you. I like the idea! We will definitely explore towards this direction and I would be very curious hear more about your experiences. You can certainly set the embedding_dims to 30 (I would tend to go lower than the dimensions of the original input PCs since they also capture batch variation which should be removed at this step).

May 19 '20 17:05 lkmklsmn

I see your point. If the missing cells are scattered across batches, you would have to assign them into a specific batch called 'missing'. We realize this may not be to most straight forward implementation and will consider changes.

Eh, I can work with that. If I have missing cells across, say 4 of the 12 batches, would it be better to put the missing celltypes into one batch (called missing) or four batches (missing1, missing2, etc.)?

To be honest, we have not systematically evaluated using the multi-dimensional embedding in downstream analysis, as suggested by you. I like the idea! We will definitely explore towards this direction and I would be very curious hear more about your experiences. You can certainly set the embedding_dims to 30 (I would tend to go lower than the dimensions of the original input PCs since they also capture batch variation which should be removed at this step).

(In my poorly informed opinion) you really should work as positioning TNN/INSCT (which I'm calling "insect", in my head, sorry) as a method to create the corrected the reduced dimensional space (e.g. the PCA/CCA/MNN equivalent) that is fed into UMAP, clustering, etc.

Otherwise it's a bit janky to have a method that makes the corrected 2D space, but doesn't help with the clustering, pseudotime, etc., which also is crucial.

May 19 '20 17:05 davemcg

One missing will do.

Valid point, I appreciate your advice. I do want to point out that you could use the integrated two dimensional embedding for downstream analysis. (clustering etc).

May 19 '20 18:05 lkmklsmn

You could (and I've seen some prominent methods do that...), but in my experience there's quite a bit of information that exists in the n>2 dimensions.

Thanks for your help!

May 19 '20 18:05 davemcg

insct insct copied to clipboard

Some questions

insct
insct copied to clipboard