BISCUIT_SingleCell_IMM_ICML_2016
BISCUIT_SingleCell_IMM_ICML_2016 copied to clipboard
Error related to .check_tsne_params
Hello!
I am trying BISCUIT on a data, however got the following error:
[1] "Loading Data"
[1] "Calculating the Fiedler vector of the data"
[1] "Ensuring entire data is numeric and then log transforming it"
[1] "numcells is 74"
[1] "numgenes is 9377"
[1] "Number of gene batches is 62"
[1] "Number of gene subbatches is 2"
[1] "Ensuring user-specified data is numeric"
[1] "Computing t-sne projection of the data"
Error in .check_tsne_params(nrow(X), dims = dims, perplexity = perplexity, :
perplexity is too large for the number of samples
In addition: There were 19 warnings (use warnings() to see them)
Data comes from Grun et al 2014.
The parameter setting is as following:
## 21st Dec 2016
## BISCUIT R implementation
## Start_file with user inputs
##
## Code author SP
###
###
############## packages required ##############
library(MCMCpack)
library(mvtnorm)
library(ellipse)
library(coda)
library(Matrix)
library(Rtsne)
library(gtools)
library(foreach)
library(doParallel)
library(doSNOW)
library(snow)
library(lattice)
library(MASS)
library(bayesm)
library(robustbase)
library(chron)
library(mnormt)
library(schoolmath)
library(RColorBrewer)
#############################################
#input_file_name <- "expression_mRNA_17-Aug-2014.txt";
input_data_tab_delimited <- TRUE; #set to TRUE if the input data is tab-delimited
is_format_genes_cells <- TRUE; #set to TRUE if input data has rows as genes and columns as cells
#choose_cells <- 3000; #comment if you want all the cells to be considered
#choose_genes <- 150; #comment if you want all the genes to be considered
gene_batch <- 150; #number of genes per batch, therefore num_batches = choose_genes (or numgenes)/gene_batch. Max value is 150
num_iter <- 5; #number of iterations, choose based on data size.
num_cores <- detectCores() - 4; #number of cores for parallel processing. Ensure that detectCores() > 1 for parallel processing to work, else set num_cores to 1.
z_true_labels_avl <- FALSE; #set this to TRUE if the true labels of cells are available, else set it to FALSE. If TRUE, ensure to populate 'z_true' with the true labels in 'BISCUIT_process_data.R'
num_cells_batch <- 1000; #set this to 1000 if input number of cells is in the 1000s, else set it to 100.
alpha <- 0.1; #DPMM dispersion parameter. A higher value spins more clusters whereas a lower value spins lesser clusters.
#output_folder_name <- "output"; #give a name for your output folder.
## call BISCUIT
source("BISCUIT_main.R")
The data I used can be found here: https://github.com/WT215/Raw_data (Grun_2i.txt
)
Thank you for your help!
Best wishes, Wenhao
Hi Wenhao,
Since you only have 74 cells, and tSNE has a default perplexity set to 30 (that is normally meant to handle larger number of cells), you would need to reduce this to 10 or 1. It is an error thrown by the Rtsne(). Set num_cells_batch <- 100; (in the start file) I could not access the Github link where the data you used is.
Let me know if these help.
Hi Wenhao,
Since you only have 74 cells, and tSNE has a default perplexity set to 30 (that is normally meant to handle larger number of cells), you would need to reduce this to 10 or 1. It is an error thrown by the Rtsne(). Set num_cells_batch <- 100; (in the start file) I could not access the Github link where the data you used is.
Let me know if these help.
Hi,
Thank you for your reply!
I set num_cells_batch <- 100
, and then rerun the code but still get the same error.
Do I also need to modified other code in other R files?
I have updated the link to the data.
Thank you very much!
Best wishes, Wenhao
In https://github.com/sandhya212/BISCUIT_SingleCell_IMM_ICML_2016/blob/master/BISCUIT_process_data.R, line 214 and 225, add the perplexity parameter= 10 (or 1) to Rtsne(). Refer Rtsne options here: https://www.rdocumentation.org/packages/Rtsne/versions/0.15/topics/Rtsne
My concern is more on a statistical level where you are clustering a highly-sparse matrix and where the #cells <<< #genes. Any clustering method will give you an answer, the question is how much can you trust the learnt pattern given such a skewed dataset.
Hi,
So I tried a larger dataset: Tung et al 2017, which was also stored in https://github.com/WT215/Raw_data (Tung.txt
).
There are around 500 cells, so I set num_cells_batch <- 1000
.
I got the following error:
[1] "Loading Data"
[1] "Calculating the Fiedler vector of the data"
[1] "Ensuring entire data is numeric and then log transforming it"
[1] "numcells is 564"
[1] "numgenes is 13058"
[1] "Number of gene batches is 261"
[1] "Number of gene subbatches is 9"
[1] "Ensuring user-specified data is numeric"
[1] "Computing t-sne projection of the data"
[1] "Monitor log.txt and outputs/plots/ folder for outputs"
[1] "floor(num_gene_batches/num_gene_sub_batches): 29"
[1] "MCMC begins"
[1] "Begin parallel processing of gene splits"
[1] "Beginning of batch 1"
[1] "End of batch 1"
[1] "Beginning of batch 2"
[1] "End of batch 2"
[1] "Beginning of batch 3"
[1] "End of batch 3"
[1] "Beginning of batch 4"
[1] "End of batch 4"
[1] "Beginning of batch 5"
[1] "End of batch 5"
[1] "Beginning of batch 6"
[1] "End of batch 6"
[1] "Beginning of batch 7"
[1] "End of batch 7"
[1] "Beginning of batch 8"
[1] "End of batch 8"
[1] "Beginning of batch 9"
[1] "End of batch 9"
[1] "Beginning of batch 10"
[1] "End of batch 10"
[1] "Beginning of batch 11"
[1] "End of batch 11"
[1] "Beginning of batch 12"
[1] "End of batch 12"
[1] "Beginning of batch 13"
[1] "End of batch 13"
[1] "Beginning of batch 14"
[1] "End of batch 14"
[1] "Beginning of batch 15"
[1] "End of batch 15"
[1] "Beginning of batch 16"
[1] "End of batch 16"
[1] "Beginning of batch 17"
[1] "End of batch 17"
[1] "Beginning of batch 18"
[1] "End of batch 18"
[1] "Beginning of batch 19"
[1] "End of batch 19"
[1] "Beginning of batch 20"
[1] "End of batch 20"
[1] "Beginning of batch 21"
[1] "End of batch 21"
[1] "Beginning of batch 22"
[1] "End of batch 22"
[1] "Beginning of batch 23"
[1] "End of batch 23"
[1] "Beginning of batch 24"
[1] "End of batch 24"
[1] "Beginning of batch 25"
[1] "End of batch 25"
[1] "Beginning of batch 26"
[1] "End of batch 26"
[1] "Beginning of batch 27"
[1] "End of batch 27"
[1] "Beginning of batch 28"
[1] "End of batch 28"
[1] "Beginning of batch 29"
[1] "End of batch 29"
[1] "End of parallel runs"
Time difference of 8.34391 mins
[1] "Merging gene splits"
[1] "Computing the global confusion matrix"
[1] "Monitor log_CM.txt in outputs folder and debug_CM.txt"
Show Traceback
Rerun with Debug
Error in { : task 2 failed - "下标出界"
Thank you for your help!
- Set num_cells_batch <- 100 since you still have < 1000 cells
- what is the word after 'task 2 failed -'?
- Can you delete the debug files and run again as a fresh instance, preferably with a smaller number of genes (like 2000) just to see that the code runs to completion.
- Set num_cells_batch <- 100 since you still have < 1000 cells
- what is the word after 'task 2 failed -'?
- Can you delete the debug files and run again as a fresh instance, preferably with a smaller number of genes (like 2000) just to see that the code runs to completion.
Hi, for Tung dataset, I set num_cells_batch <- 100
but got the error:
Error in { : task 3 failed - "无法分配大小为1.3 Gb的矢量"
which means cannot allocate 1.3Gb...
When l tried it on a smaller subset of the data, like 2000 genes, I got the error:
Error in { : task 1 failed - "无法分配大小为30.5 Mb的矢量"
.
Then I reduced the dataset to include 1000 genes and it works ok. How to apply BISCUIT on the whole Tung data set (13058genes*564cells)?
Do I have to run it using clusters?
Thanks a lot!
Yes, we have run Biscuit on AWS clusters.
Yes, we have run Biscuit on AWS clusters.
Then how could I estimate how much memory should be allocated for a dataset like Tung et al. in advance?
Thanks a lot!