seurat
seurat copied to clipboard
VlnPlot removes violins below the threshold from the graphical output
Dear Seurat team,
by exploring some genes that look quite specific on the VlnPlots we noticed, that by looking through ridges in some cases the violins are deleted in the graphical output if they are below the threshold. Here is the example of the Violin with standard VlnPlot function:
Here is the output by plotting the same gene with ggplot2 geometrical violins. As you see, the violins in groups 1 and 4 look the same, but 2 and 3 appear.
Why does the VlnPlot cutoff the 2 groups in the middle? What do you think about this possible misleading in the visualization?
Hi @vkavaka
Could you post a reproducible example for this VlnPlot issue? You may use pbmc_small
or any dataset in SeuratData
or any public data. Thanks.
Dear @yuhanH, thank you for your prompt reply. We created the reproducible example using the pbmc3k dataset. Here is the code:
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
VlnPlot(pbmc, "NKG7", pt.size=0)
vln_df = data.frame(NKG7 = pbmc[["RNA"]]@data["NKG7",], cluster = pbmc$seurat_clusters)
ggplot(vln_df, aes(x = cluster, y = NKG7)) + geom_violin(aes(fill = cluster), trim=TRUE, scale = "width")
Here is the Violin using the VlnPlot:
Same with the ggplot2 (as you can see the violins below the cutoff start to appear):
Session info:
R version 4.1.2 (2021-11-01) ggplot2_3.3.5 SeuratData_0.2.1 SeuratObject_4.0.4 Seurat_4.0.6
@yuhanH as a possible reason: we suggest it might be the noising build in the VlnPlot function leading to removing the violins in the graphical output. Would be very happy to read your opinion on that behalf
Dear @yuhanH , do you have any updates on that behalf? In our opinion, the issue is quite important and possibly leading to the misinterpretation of the "specific looking" results
hi @vkavaka Thanks for showing this reproducible example. I agree with you that the change of the violin plots is related to the noise.
vln_df = data.frame(NKG7 = pbmc[["RNA"]]@data["NKG7",], cluster = pbmc$seurat_clusters)
noise <- rnorm(n = length(x =vln_df$NKG7)) / 100000
vln_df$NKG7.noise <- vln_df$NKG7 + noise
ggplot(vln_df, aes(x = cluster, y = NKG7)) + geom_violin(aes(fill = cluster), trim=TRUE, scale = "width")
ggplot(vln_df, aes(x = cluster, y = NKG7.noise)) + geom_violin(aes(fill = cluster), trim=TRUE, scale = "width")
You can also see that the noise is very small and it mainly just introduce very small variation for the data.
Not sure why it effectively affects Violin shapes. It seems to be an issue related to
geom_violin
.
But I also agree that it may lead to the misinterpretation of the specific looking results. It suggests that you would better keep showing the data points in the violin plot.
@yuhanH thank you for your reply and suggestion. Would you consider still keeping the noise in the vlnplot function? The only clusters that are affected seem to be the ones with the lower expression, the higher ones are completely unchanged.
Not very sure whether showing the cell points is the best way to overcome this bias, especially with a lot of cells in the object. We noticed, that after a certain point you cannot lower the size of the dots with pt.size argument of the VlnPlot. Any ideas on how to overcome this limitation and print the dots even smaller?
Right. When the number of cells is big, you may consider changing the alpha value for the points. For example:
p0 <- VlnPlot(pbmc, "NKG7")
p1 <- VlnPlot(pbmc, "NKG7")
p1$layers[[2]]$aes_params$alpha <- 0.1
p0+p1
We will add this alpha value parameter into VlnPlot
soon.
@yuhanH thank you for the hint with the alpha values. And what do you think about the noise? I understand, that the developers wouldn't add it up if it would not be necessary. But as you can see in this example, it may affect the data visualization. Is there any explanation, why the noise should be kept and used further?
Hi @vkavaka The distribution of low expression values in the original data appears to be less fitting with the dots in the plot. For now, we retain this noise. However, we remain open to reconsidering and possibly removing it if there are clear biases emerge as a consequence of this noise.
hi