monocle3 icon indicating copy to clipboard operation
monocle3 copied to clipboard

all FAIL when running fit_model

Open SBata opened this issue 3 years ago • 2 comments

hi all

I ran monocle3 passing a Seurat object following the instructions here (https://satijalab.org/signac/articles/monocle.html).

then I wanted to run the DEG analysis but the returned data.frame looks as follows:

> gene_fits
# A tibble: 17,541 x 10
   vst.mean vst.variance vst.variance.expect… vst.variance.standard… vst.variable num_cells_express… gene_id     model     model_summary status
      <dbl>        <dbl>                <dbl>                  <dbl> <lgl>                     <int> <chr>       <named l> <named list>  <chr> 
 1 0.586        0.704                0.890                     0.791 FALSE                      1929 MRPL15      <speedgl… <smmry.sp>    FAIL  
 2 0.321        0.349                0.423                     0.824 FALSE                      1230 LYPLA1      <speedgl… <smmry.sp>    FAIL  
 3 0.000186     0.000186             0.000186                  1.00  FALSE                         1 GM37988     <speedgl… <smmry.sp>    FAIL  
 4 0.581        0.741                0.880                     0.842 FALSE                      1848 TCEA1       <speedgl… <smmry.sp>    FAIL  
 5 0.461        0.581                0.658                     0.882 FALSE                      1569 ATP6V1H     <speedgl… <smmry.sp>    FAIL  
 6 0.320        0.408                0.422                     0.968 FALSE                      1080 RB1CC1      <speedgl… <smmry.sp>    FAIL  
 7 0.0359       0.0387               0.0452                    0.856 FALSE                       142 4732440D04… <speedgl… <smmry.sp>    FAIL  
 8 0            0                    0                         0     FALSE                         0 ALKAL1      <speedgl… <smmry.sp>    FAIL  
 9 0.0102       0.0109               0.0130                    0.838 FALSE                        44 ST18        <speedgl… <smmry.sp>    FAIL  
10 0.274        0.363                0.352                     1.03  FALSE                       926 PCMTD1      <speedgl… <smmry.sp>    FAIL 

where all status are FAIL

the code I used is the following:

sc_time <- subset(sc_recluster, cells = which([email protected]$seurat_clusters %in% c(0,1,2,4,6)))
DefaultAssay(sc_time) <- "RNA"
sc_time <- as.cell_data_set(sc_time)

sc_time <- preprocess_cds(sc_time)
plot_pc_variance_explained(sc_time)
sc_time <- reduce_dimension(sc_time,reduction_method = "UMAP")
sc_time <- cluster_cells(sc_time)
sc_time <- learn_graph(sc_time)
sc_time <- order_cells(sc_time)
# sc_time@principal_graph_aux@listData$UMAP$pseudotime
gene_fits <- fit_models(sc_time, model_formula_str = "~ pseudotime", cores=8)

also, I don't seem to have any infinite or NA values

> which(is.infinite(as.numeric(sc_time@principal_graph_aux@listData$UMAP$pseudotime)))
integer(0)
> which(is.na(as.numeric(sc_time@principal_graph_aux@listData$UMAP$pseudotime)))
integer(0)

when I look at the p.value for all models:

pvs <- lapply(gene_fits$model, function(xx){
  pv <- summary(xx)$coefficient[["Pr(>|t|)"]][[2]]
  return(pv)
})
table(unlist(pvs))
0.999999999825233 
            17541 

looks like they are all ~1

if I look at the model for the first gene:

> gene_fits$model$MRPL15
Generalized Linear Model of class 'speedglm':

Call:  speedglm::speedglm(formula = model_formula, family = stats::quasipoisson(),      model = FALSE, y = FALSE, acc = 0.001) 

Coefficients:
(Intercept)   pseudotime  
 -2.730e+01    1.962e-12  

and its summary:

> summary(gene_fits$model$MRPL15)
Generalized Linear Model of class 'speedglm':

Call:  speedglm::speedglm(formula = model_formula, family = stats::quasipoisson(),      model = FALSE, y = FALSE, acc = 0.001) 

Coefficients:
 ------------------------------------------------------------------ 
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.730e+01   0.039007  -699.9    0e+00 ***
pseudotime   1.962e-12   0.008958     0.0    1e+00    

------------------------------------------------------------------- 
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

---
null df: 4446; null deviance: 0;
residuals df: 4445; residuals deviance: 0;
# obs.: 4447; # non-zero weighted obs.: 4447;
AIC: NA; log Likelihood: NA;
RSS: 0; dispersion: 3.776833e-12; iterations: 25;
rank: 2; max tolerance: 2.12e-07; convergence: FALSE.

it did run although it was not significant.

I tried to run it separately as:

> glm_time <- speedglm::speedglm(to_plot$MRPL15 ~ to_plot$Pseudotime, 
+                 family = stats::quasipoisson(), 
+                 model = FALSE, 
+                 y = FALSE, 
+                 acc = 0.001)
> summary(glm_time)
Generalized Linear Model of class 'speedglm':

Call:  speedglm::speedglm(formula = to_plot$MRPL15 ~ to_plot$Pseudotime,      family = stats::quasipoisson(), model = FALSE, y = FALSE,      acc = 0.001) 

Coefficients:
 ------------------------------------------------------------------ 
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -0.91603   0.010297  -88.96  0.0e+00 ***
to_plot$Pseudotime -0.04808   0.002431  -19.78  1.7e-83 ***

------------------------------------------------------------------- 
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

---
null df: 4446; null deviance: 131.98;
residuals df: 4445; residuals deviance: 122.33;
# obs.: 4447; # non-zero weighted obs.: 4447;
AIC: NA; log Likelihood: NA;
RSS: 110.3; dispersion: 0.02482102; iterations: 4;
rank: 2; max tolerance: 1.81e-11; convergence: TRUE.
> glm_time
Generalized Linear Model of class 'speedglm':

Call:  speedglm::speedglm(formula = to_plot$MRPL15 ~ to_plot$Pseudotime,      family = stats::quasipoisson(), model = FALSE, y = FALSE,      acc = 0.001) 

Coefficients:
       (Intercept)  to_plot$Pseudotime  
          -0.91603            -0.04808 

here it is marked as significant...not sure why thou.

any suggestion would be great, thank you

SBata avatar Apr 20 '21 13:04 SBata

Hey I just ran into a very similar problem. I used the as.cell_data_set function from the SeuratWrappers package to convert my Seurat object to a cell data set for pseudotime analysis. When I ran monocle3::fit_models(my_cds, model_formula_str = "~pseudotime"), all the p-values were either 0 or 1, and all the q-values were 1.

I had a hunch that as.cell_data_set maybe doesn't put the gene expression matrix in the slot where fit_models was looking for expression data. I don't know what the problem is exactly, but I found a workaround that seems to work. So instead of as.cell_data_set, I converted the Seruat object with the following code snippet:


# normalized expression matrix from Seurat object
expr_matrix  <- Seurat::GetAssayData(my_seurat, assay='RNA', slot='data')

# create cell dataset, don't worry about the warning
my_cds <- monocle3::new_cell_data_set(
  expr_matrix,
  [email protected]
)

# process dataset 
my_cds <- cluster_cells(cds=my_cds, reduction_method='UMAP')
my_cds <- learn_graph(my_cds, use_partition=TRUE)

# use the helper function in the monocle3 documentation
# https://cole-trapnell-lab.github.io/monocle3/docs/trajectories/
principal_node <- get_earliest_principal_node(my_cds,partition = 1)

# order cells in pseudotime 
my_cds <- order_cells(my_cds,root_pr_nodes = principal_node)

# finally run fit_models 
gene_fits <- monocle3::fit_models(my_cds, model_formula_str = "~pseudotime")
fit_coefs <- coefficient_table(gene_fits)
fit_coefs$term
fit_coefs %>% filter(term == "pseudotime" & q_value <= 0.05) %>%
      select(gene_id, term, q_value, estimate) 

I hope that this workaround helps you or somebody else! I still don't really know what is wrong with using as.cell_data_set.

smorabit avatar Jun 03 '21 00:06 smorabit

hi all

I ran monocle3 passing a Seurat object following the instructions here (https://satijalab.org/signac/articles/monocle.html).

then I wanted to run the DEG analysis but the returned data.frame looks as follows:

> gene_fits
# A tibble: 17,541 x 10
   vst.mean vst.variance vst.variance.expect… vst.variance.standard… vst.variable num_cells_express… gene_id     model     model_summary status
      <dbl>        <dbl>                <dbl>                  <dbl> <lgl>                     <int> <chr>       <named l> <named list>  <chr> 
 1 0.586        0.704                0.890                     0.791 FALSE                      1929 MRPL15      <speedgl… <smmry.sp>    FAIL  
 2 0.321        0.349                0.423                     0.824 FALSE                      1230 LYPLA1      <speedgl… <smmry.sp>    FAIL  
 3 0.000186     0.000186             0.000186                  1.00  FALSE                         1 GM37988     <speedgl… <smmry.sp>    FAIL  
 4 0.581        0.741                0.880                     0.842 FALSE                      1848 TCEA1       <speedgl… <smmry.sp>    FAIL  
 5 0.461        0.581                0.658                     0.882 FALSE                      1569 ATP6V1H     <speedgl… <smmry.sp>    FAIL  
 6 0.320        0.408                0.422                     0.968 FALSE                      1080 RB1CC1      <speedgl… <smmry.sp>    FAIL  
 7 0.0359       0.0387               0.0452                    0.856 FALSE                       142 4732440D04… <speedgl… <smmry.sp>    FAIL  
 8 0            0                    0                         0     FALSE                         0 ALKAL1      <speedgl… <smmry.sp>    FAIL  
 9 0.0102       0.0109               0.0130                    0.838 FALSE                        44 ST18        <speedgl… <smmry.sp>    FAIL  
10 0.274        0.363                0.352                     1.03  FALSE                       926 PCMTD1      <speedgl… <smmry.sp>    FAIL 

where all status are FAIL

the code I used is the following:

sc_time <- subset(sc_recluster, cells = which([email protected]$seurat_clusters %in% c(0,1,2,4,6)))
DefaultAssay(sc_time) <- "RNA"
sc_time <- as.cell_data_set(sc_time)

sc_time <- preprocess_cds(sc_time)
plot_pc_variance_explained(sc_time)
sc_time <- reduce_dimension(sc_time,reduction_method = "UMAP")
sc_time <- cluster_cells(sc_time)
sc_time <- learn_graph(sc_time)
sc_time <- order_cells(sc_time)
# sc_time@principal_graph_aux@listData$UMAP$pseudotime
gene_fits <- fit_models(sc_time, model_formula_str = "~ pseudotime", cores=8)

also, I don't seem to have any infinite or NA values

> which(is.infinite(as.numeric(sc_time@principal_graph_aux@listData$UMAP$pseudotime)))
integer(0)
> which(is.na(as.numeric(sc_time@principal_graph_aux@listData$UMAP$pseudotime)))
integer(0)

when I look at the p.value for all models:

pvs <- lapply(gene_fits$model, function(xx){
  pv <- summary(xx)$coefficient[["Pr(>|t|)"]][[2]]
  return(pv)
})
table(unlist(pvs))
0.999999999825233 
            17541 

looks like they are all ~1

if I look at the model for the first gene:

> gene_fits$model$MRPL15
Generalized Linear Model of class 'speedglm':

Call:  speedglm::speedglm(formula = model_formula, family = stats::quasipoisson(),      model = FALSE, y = FALSE, acc = 0.001) 

Coefficients:
(Intercept)   pseudotime  
 -2.730e+01    1.962e-12  

and its summary:

> summary(gene_fits$model$MRPL15)
Generalized Linear Model of class 'speedglm':

Call:  speedglm::speedglm(formula = model_formula, family = stats::quasipoisson(),      model = FALSE, y = FALSE, acc = 0.001) 

Coefficients:
 ------------------------------------------------------------------ 
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.730e+01   0.039007  -699.9    0e+00 ***
pseudotime   1.962e-12   0.008958     0.0    1e+00    

------------------------------------------------------------------- 
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

---
null df: 4446; null deviance: 0;
residuals df: 4445; residuals deviance: 0;
# obs.: 4447; # non-zero weighted obs.: 4447;
AIC: NA; log Likelihood: NA;
RSS: 0; dispersion: 3.776833e-12; iterations: 25;
rank: 2; max tolerance: 2.12e-07; convergence: FALSE.

it did run although it was not significant.

I tried to run it separately as:

> glm_time <- speedglm::speedglm(to_plot$MRPL15 ~ to_plot$Pseudotime, 
+                 family = stats::quasipoisson(), 
+                 model = FALSE, 
+                 y = FALSE, 
+                 acc = 0.001)
> summary(glm_time)
Generalized Linear Model of class 'speedglm':

Call:  speedglm::speedglm(formula = to_plot$MRPL15 ~ to_plot$Pseudotime,      family = stats::quasipoisson(), model = FALSE, y = FALSE,      acc = 0.001) 

Coefficients:
 ------------------------------------------------------------------ 
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -0.91603   0.010297  -88.96  0.0e+00 ***
to_plot$Pseudotime -0.04808   0.002431  -19.78  1.7e-83 ***

------------------------------------------------------------------- 
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

---
null df: 4446; null deviance: 131.98;
residuals df: 4445; residuals deviance: 122.33;
# obs.: 4447; # non-zero weighted obs.: 4447;
AIC: NA; log Likelihood: NA;
RSS: 110.3; dispersion: 0.02482102; iterations: 4;
rank: 2; max tolerance: 1.81e-11; convergence: TRUE.
> glm_time
Generalized Linear Model of class 'speedglm':

Call:  speedglm::speedglm(formula = to_plot$MRPL15 ~ to_plot$Pseudotime,      family = stats::quasipoisson(), model = FALSE, y = FALSE,      acc = 0.001) 

Coefficients:
       (Intercept)  to_plot$Pseudotime  
          -0.91603            -0.04808 

here it is marked as significant...not sure why thou.

any suggestion would be great, thank you

same problem here, have you ever figure it out?

vicscott avatar Mar 30 '22 13:03 vicscott