homlr icon indicating copy to clipboard operation
homlr copied to clipboard

Figure 20.8 not working

Open ale-ch opened this issue 3 years ago • 2 comments

The following code:

set.seed(123)

fviz_nbclust(
  ames_1hot_scaled, 
  kmeans, 
  method = "wss", 
  k.max = 25, 
  verbose = FALSE
)

Returns:

Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

My environment is: R version 4.0.5 (2021-03-31) factoextra 1.0.7 AmesHousing 0.0.4 caret 6.0.86 dplyr 1.0.5

ale-ch avatar May 16 '21 12:05 ale-ch

I think the resaon is that in the scale step produced NA (I dont know why it does this) before running this code,you run the following code ames_1hot_scaled[,"Neighborhood.Hayden_Lake"] <- 0

then it will run well.

panjunchang avatar May 10 '22 07:05 panjunchang

Apparently the ames data set was updated from v0.0.3 to v0.0.4 and the Neighborhood variable now contains a "Hayden_Lake" factor level but there are no observations for that neighborhood when using AmesHousing::make_ames() (see last bullet in this NEWS.md file).

# Hayden_Lake shows up as a level
levels(ames_full[["Neighborhood"]])
 ## [1] "North_Ames"                              "College_Creek"                          
 ## [3] "Old_Town"                                "Edwards"                                
## [5] "Somerset"                                "Northridge_Heights"                     
## [7] "Gilbert"                                 "Sawyer"                                 
## [9] "Northwest_Ames"                          "Sawyer_West"                            
## [11] "Mitchell"                                "Brookside"                              
## [13] "Crawford"                                "Iowa_DOT_and_Rail_Road"                 
## [15] "Timberland"                              "Northridge"                             
## [17] "Stone_Brook"                             "South_and_West_of_Iowa_State_University"
## [19] "Clear_Creek"                             "Meadow_Village"                         
## [21] "Briardale"                               "Bloomington_Heights"                    
## [23] "Veenker"                                 "Northpark_Villa"                        
## [25] "Blueste"                                 "Greens"                                 
## [27] "Green_Hills"                             "Landmark"                               
##[29] "Hayden_Lake"   

# But there are no observations for that level
as_tibble(ames_1hot) %>% 
  select(Neighborhood.Hayden_Lake) %>% 
  distinct()
## # A tibble: 1 × 1
## Neighborhood.Hayden_Lake
##                    <dbl>
## 1                      0

Consequently, when you one-hot encode that column you end up getting the Neighborhood.Hayden_Lake column filled with zeros and then when you try to scale this you get NaNs:

> as_tibble(ames_1hot_scaled) %>% select(Neighborhood.Hayden_Lake)
## # A tibble: 2,930 × 1
##    Neighborhood.Hayden_Lake
##                       <dbl>
##  1                      NaN
##  2                      NaN
##  3                      NaN
##  4                      NaN
##  5                      NaN
##  6                      NaN
##  7                      NaN
##  8                      NaN
##  9                      NaN
## 10                      NaN

If we coerce this column to a character data type prior to one-hot encoding then it works as illustrated in the book:

ames_full <- AmesHousing::make_ames() %>%
  mutate_if(str_detect(names(.), 'Qual|Cond|QC|Qu'), as.numeric) %>% 
  mutate_if(is.factor, as.character)

full_rank  <- caret::dummyVars(Sale_Price ~ ., data = ames_full,  fullRank = TRUE)
ames_1hot <- predict(full_rank, ames_full)
dim(ames_1hot_scaled)
## [1] 2930  240

bradleyboehmke avatar May 10 '22 14:05 bradleyboehmke