PatientLevelPrediction icon indicating copy to clipboard operation
PatientLevelPrediction copied to clipboard

NAN value in python RandomForest Input

Open aaalgo opened this issue 3 years ago • 3 comments

Describe the bug RunPlp failed with exception

finished MapCovariates
toSparseM non temporal used
finishing toSparseM
Sourcing python code
Converting to python data
Applying Variable Importance
Using Random Forest to select features
population loaded- 1088 rows and 3 columns
Error in value[[3L]](cond) : 
  Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

Detailed traceback:
  File "<string>", line 136, in rf_var_imp
  File "/home/wdong/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/sklearn/ensemble/_forest.py", line 309, in fit
    accept_sparse="csc", dtype=DTYPE)
  File "/home/wdong/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/sklearn/base.py", line 433, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/home/wdong/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/home/wdong/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/sklearn/utils/validation.py", line 878, in check_X_y
    estimator=estimator)
  File "/home/wdong/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packa
Calls: execute ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>

I can remove the error with the following change:

@@ -119,6 +119,8 @@ fitRandomForest <- function(population, plpData, param, search='grid', quiet=F,
   #x <- toSparsePython2(plpData,population, map=NULL)
   prediction <- population
   x <- toSparseM(plpData,population,map=NULL, temporal = F)
+  # x$data[is.nan(x$data)] <- 0      <---------------------------------------------------------- FIX
 
   ParallelLogger::logInfo('Sourcing python code')
   reticulate::source_python(system.file(package='PatientLevelPrediction','python','randomForestFunctions.py'), envir = e)

The above patch fixes the problem.

The covariates data are generated by injecting external data into database and load with "Creating covariates using cohort attributes" (https://raw.githubusercontent.com/OHDSI/FeatureExtraction/master/inst/doc/CreatingCovariatesUsingCohortAttributes.pdf).

Set up (please run in R "sessionInfo()" and copy the output here):

To Reproduce

PLP Log File plplog.txt

Full output directory http://www.wdong.org/~wdong/output.tar.

Additional context

I made two other edits to Plp commit 14cf29ed:

  • Remove the minimal outcome constraint.
  • Remove exception handling when calling runPlp so exceptions are not silently ignored.
diff --git a/R/RandomForest.R b/R/RandomForest.R
index 7ebd65d..b554f61 100644
--- a/R/RandomForest.R
+++ b/R/RandomForest.R
@@ -119,6 +119,8 @@ fitRandomForest <- function(population, plpData, param, search='grid', quiet=F,
   #x <- toSparsePython2(plpData,population, map=NULL)
   prediction <- population
   x <- toSparseM(plpData,population,map=NULL, temporal = F)
+  saveRDS(x, "not_used.rds")
+  # x$data[is.nan(x$data)] <- 0
 
   ParallelLogger::logInfo('Sourcing python code')
   reticulate::source_python(system.file(package='PatientLevelPrediction','python','randomForestFunctions.py'), envir = e)
diff --git a/R/RunMultiplePlp.R b/R/RunMultiplePlp.R
index 565d8cc..62b2140 100644
--- a/R/RunMultiplePlp.R
+++ b/R/RunMultiplePlp.R
@@ -268,9 +268,10 @@ runPlpAnalyses <- function(connectionDetails,
       runPlpSettings$savePlpResult <- T
       runPlpSettings$savePlpPlots <- F
       runPlpSettings$saveEvaluation <- F
-      result <- tryCatch(do.call(runPlp, runPlpSettings),
-                             finally= ParallelLogger::logTrace('Done runPlp.'), 
-                             error= function(cond){ParallelLogger::logTrace(paste0('Error with runPlp:',cond));return(NULL)})
+      result <- do.call(runPlp, runPlpSettings)
+      #result <- tryCatch(do.call(runPlp, runPlpSettings),
+      #                       finally= ParallelLogger::logTrace('Done runPlp.'), 
+      #                       error= function(cond){ParallelLogger::logTrace(paste0('Error with runPlp:',cond));return(NULL)})
     }
     
   }
diff --git a/R/RunPlp.R b/R/RunPlp.R
index 34d0f76..8bbec2a 100644
--- a/R/RunPlp.R
+++ b/R/RunPlp.R
@@ -215,7 +215,7 @@ runPlp <- function(population, plpData,  minCovariateFraction = 0.001, normalize
   ParallelLogger::logDebug(paste0('testSplit: ', testSplit))
   checkInStringVector(testSplit, c('person','time', 'stratified','subject'))
   ParallelLogger::logDebug(paste0('outcomeCount: ', sum(population[,'outcomeCount']>0)))
-  checkHigherEqual(sum(population[,'outcomeCount']>0), 25)
+  checkHigherEqual(sum(population[,'outcomeCount']>0), 0)
   ParallelLogger::logDebug(paste0('plpData class: ', class(plpData)))
   checkIsClass(plpData, c('plpData'))
   ParallelLogger::logDebug(paste0('testfraction: ', testFraction))

aaalgo avatar Jun 25 '21 14:06 aaalgo

What features were you using? The data shouldn't have NA values

jreps avatar Sep 30 '21 12:09 jreps

I had an experience similar to this issue. I also injected external covariates into the covariateData file. When my data had 0 value in covariateValue column of the covariateData, then I got NaN error when running random forest (but the error didn't occur when running lasso or gbm). After I removed zero values in external covariates, the error was resolved. It seems that zero value converted to NaN in the r_to_py step maybe-.

ChungsooKim avatar Nov 04 '21 05:11 ChungsooKim

Interesting - I'll test removing entries with a 0 value to prevent this as the sparse matrix defaults to a value of 0.

jreps avatar Oct 27 '22 19:10 jreps