“All models failed” with spatial_block_cv() + tune_grid() — “arguments imply differing number of rows: 0, 1” error

Open andrestyle16 opened this issue 1 year ago • 1 comments

Brief description of the problem

I am experiencing an error when using spatial_block_cv() from {spatialsample} together with {tidymodels}' tune_grid() to perform spatial cross-validation on my dataset. The same dataset and modeling approach works fine with a standard vfold_cv(), but fails in all folds with an error message:

Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 0, 1 Error in estimate_tune_results(): ! All models failed. Run show_notes(.Last.tune.result) for more information.

I have verified the following:

No empty folds: I checked analysis() and assessment() sets in each fold; they all have >0 rows and include both classes (0 and 1).
No recipe issues: I removed all recipe steps (including step_corr() and step_zv()), or even removed the recipe entirely, and the error persists.
Simple model: I tested a non-tunable rand_forest model with fit_resamples() (i.e., no hyperparameter grid) and still see the same failure.
vfold_cv() works: If I switch from spatial_block_cv() to vfold_cv(), the model + data run successfully through tune_grid() or fit_resamples() with no errors.
Indices look correct: The number of rows in analysis() and assessment() is consistent across folds, each includes humedal == 0 and humedal == 1, so no single-class or 0-row subsets.
Tried reducing folds / radius: For instance, v=3 or radius=50 instead of v=5 and radius=100. The same error arises.
Tried removing geometry: I used a typical approach of reassigning splits with make_splits() to remove geometry from each fold’s analysis/assessment sets, and forced class(...) <- class(folds_spatial). The error persists.

Repro steps and partial code

Below is a simplified version of my workflow:

library(tidymodels)
library(spatialsample)
library(sf)
library(terra)
library(dplyr)
library(purrr)

# Example: I have ~827 data points (presence/absence)
# plus 4 raster-based predictors: ndvi, mndwi, pendiente, ti
# My dataset is an sf object with geometry.

# 1) I create folds:
set.seed(1996)
folds_spatial <- spatial_block_cv(
  data   = my_data_sf,   # ~827 points
  v      = 5,
  radius = 100
)

# 2) Drop geometry in each split:
my_data_nogeo <- st_drop_geometry(my_data_sf)

folds_spatial_nogeo <- folds_spatial %>%
  mutate(
    splits = map(splits, function(s) {
      i_ana <- s$in_id
      i_ass <- s$out_id
      
      rsample::make_splits(
        x    = list(analysis = i_ana, assessment = i_ass),
        data = my_data_nogeo,
        class= "spatial_block_split"
      )
    })
  )
# Restore class
class(folds_spatial_nogeo) <- class(folds_spatial)

# 3) Model specification
rf_spec <- rand_forest(trees = 500) %>%
  set_mode("classification") %>%
  set_engine("ranger", probability = TRUE)

my_wf <- workflow() %>%
  add_model(rf_spec)
  # (Sometimes I add a recipe, or none.)

set.seed(1996)
res <- fit_resamples(
  my_wf,
  resamples = folds_spatial_nogeo
)
# -> Fails with:
# Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 0, 1
# All models failed.

I also tried tune_grid() with a small grid of mtry and min_n and got the same result (All models failed).

Observations / Diagnostics

If I switch to vfold_cv(my_data_nogeo, v=5, strata=humedal), everything works.
The dataset is not huge, but I do have enough rows in each fold (I double-checked with a for loop printing nrow(analysis(...)), nrow(assessment(...)) and the distribution of humedal).
Reducing to v=2 or v=3, or changing radius from 100 to 50, did not help.
Removing any recipe steps or hyperparameter tuning also did not help.
My sessionInfo() is below.

Session Info

# Please see below:
sessionInfo()
# or sessioninfo::session_info()

R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Santiago
tzcode source: internal

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] mapview_2.11.2      spatialsample_0.6.0 vip_0.4.1           DALEX_2.4.3         GGally_2.2.1       
 [6] corrplot_0.95       doParallel_1.0.17   iterators_1.0.14    foreach_1.5.2       ranger_0.17.0      
[11] yardstick_1.3.1     workflowsets_1.1.0  workflows_1.1.4     tune_1.2.1          rsample_1.2.1      
[16] recipes_1.1.0       parsnip_1.2.1       modeldata_1.4.0     infer_1.0.7         dials_1.3.0        
[21] scales_1.3.0        broom_1.0.7         tidymodels_1.2.0    janitor_2.2.1       here_1.0.1         
[26] terra_1.8-5         sf_1.0-19           lubridate_1.9.4     forcats_1.0.0       stringr_1.5.1      
[31] dplyr_1.1.4         purrr_1.0.2         readr_2.1.5         tidyr_1.3.1         tibble_3.2.1       
[36] ggplot2_3.5.1       tidyverse_2.0.0    

loaded via a namespace (and not attached):
 [1] DBI_1.2.3           rlang_1.1.4         magrittr_2.0.3      snakecase_0.11.1    furrr_0.3.1        
 [6] e1071_1.7-16        compiler_4.4.1      png_0.1-8           vctrs_0.6.5         lhs_1.2.0          
[11] fastmap_1.2.0       pkgconfig_2.0.3     backports_1.5.0     leafem_0.2.3        utf8_1.2.4         
[16] prodlim_2024.06.25  tzdb_0.4.0          satellite_1.0.5     xfun_0.49           R6_2.5.1           
[21] stringi_1.8.4       RColorBrewer_1.1-3  parallelly_1.41.0   rpart_4.1.23        Rcpp_1.0.13-1      
[26] knitr_1.49          future.apply_1.11.3 base64enc_0.1-3     Matrix_1.7-0        splines_4.4.1      
[31] nnet_7.3-19         timechange_0.3.0    tidyselect_1.2.1    rstudioapi_0.17.1   timeDate_4041.110  
[36] codetools_0.2-20    listenv_0.9.1       lattice_0.22-6      plyr_1.8.9          withr_3.0.2        
[41] evaluate_1.0.1      future_1.34.0       survival_3.6-4      ggstats_0.7.0       units_0.8-5        
[46] proxy_0.4-27        pillar_1.10.0       rsconnect_1.3.3     KernSmooth_2.23-24  stats4_4.4.1       
[51] generics_0.1.3      sp_2.1-4            rprojroot_2.0.4     hms_1.1.3           munsell_0.5.1      
[56] globals_0.16.3      class_7.3-22        glue_1.8.0          tools_4.4.1         data.table_1.16.4  
[61] gower_1.0.2         grid_4.4.1          crosstalk_1.2.1     ipred_0.9-15        colorspace_2.1-1   
[66] raster_3.6-30       cli_3.6.3           DiceDesign_1.10     lava_1.8.0          gtable_0.3.6       
[71] GPfit_1.0-8         digest_0.6.37       classInt_0.4-10     farver_2.1.2        htmlwidgets_1.6.4  
[76] htmltools_0.5.8.1   leaflet_2.2.2       lifecycle_1.0.4     hardhat_1.4.0       MASS_7.3-60.2

Any guidance would be greatly appreciated! I suspect either:

A subtle bug or mismatch in how spatial_block_cv() interacts with analysis()/assessment() inside tidymodels, or
Some unknown configuration in my environment that leads to arguments imply differing number of rows: 0, 1.

Thank you for looking into this!

Dec 26 '24 14:12 andrestyle16

Thank you for the issue! I don't have access to your my_data_sf object. Are you able to reproduce this issue with a reprex (reproducible example), using publicly available or simulated data? A reprex will help me troubleshoot and fix your issue more quickly.🙂

That said, this looks like it may live more cozily in spatialsample rather than rules, so I will transfer this issue to that repository. The issues you're seeing may be due to the manual transformations you've labeled step 2).

Jan 04 '25 20:01 simonpcouch