PatientLevelPrediction
PatientLevelPrediction copied to clipboard
NAN value in python RandomForest Input
Describe the bug RunPlp failed with exception
finished MapCovariates
toSparseM non temporal used
finishing toSparseM
Sourcing python code
Converting to python data
Applying Variable Importance
Using Random Forest to select features
population loaded- 1088 rows and 3 columns
Error in value[[3L]](cond) :
Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
Detailed traceback:
File "<string>", line 136, in rf_var_imp
File "/home/wdong/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/sklearn/ensemble/_forest.py", line 309, in fit
accept_sparse="csc", dtype=DTYPE)
File "/home/wdong/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/sklearn/base.py", line 433, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "/home/wdong/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/home/wdong/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/sklearn/utils/validation.py", line 878, in check_X_y
estimator=estimator)
File "/home/wdong/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packa
Calls: execute ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
I can remove the error with the following change:
@@ -119,6 +119,8 @@ fitRandomForest <- function(population, plpData, param, search='grid', quiet=F,
#x <- toSparsePython2(plpData,population, map=NULL)
prediction <- population
x <- toSparseM(plpData,population,map=NULL, temporal = F)
+ # x$data[is.nan(x$data)] <- 0 <---------------------------------------------------------- FIX
ParallelLogger::logInfo('Sourcing python code')
reticulate::source_python(system.file(package='PatientLevelPrediction','python','randomForestFunctions.py'), envir = e)
The above patch fixes the problem.
The covariates data are generated by injecting external data into database and load with "Creating covariates using cohort attributes" (https://raw.githubusercontent.com/OHDSI/FeatureExtraction/master/inst/doc/CreatingCovariatesUsingCohortAttributes.pdf).
Set up (please run in R "sessionInfo()" and copy the output here):
To Reproduce
PLP Log File plplog.txt
Full output directory http://www.wdong.org/~wdong/output.tar.
Additional context
I made two other edits to Plp commit 14cf29ed:
- Remove the minimal outcome constraint.
- Remove exception handling when calling runPlp so exceptions are not silently ignored.
diff --git a/R/RandomForest.R b/R/RandomForest.R
index 7ebd65d..b554f61 100644
--- a/R/RandomForest.R
+++ b/R/RandomForest.R
@@ -119,6 +119,8 @@ fitRandomForest <- function(population, plpData, param, search='grid', quiet=F,
#x <- toSparsePython2(plpData,population, map=NULL)
prediction <- population
x <- toSparseM(plpData,population,map=NULL, temporal = F)
+ saveRDS(x, "not_used.rds")
+ # x$data[is.nan(x$data)] <- 0
ParallelLogger::logInfo('Sourcing python code')
reticulate::source_python(system.file(package='PatientLevelPrediction','python','randomForestFunctions.py'), envir = e)
diff --git a/R/RunMultiplePlp.R b/R/RunMultiplePlp.R
index 565d8cc..62b2140 100644
--- a/R/RunMultiplePlp.R
+++ b/R/RunMultiplePlp.R
@@ -268,9 +268,10 @@ runPlpAnalyses <- function(connectionDetails,
runPlpSettings$savePlpResult <- T
runPlpSettings$savePlpPlots <- F
runPlpSettings$saveEvaluation <- F
- result <- tryCatch(do.call(runPlp, runPlpSettings),
- finally= ParallelLogger::logTrace('Done runPlp.'),
- error= function(cond){ParallelLogger::logTrace(paste0('Error with runPlp:',cond));return(NULL)})
+ result <- do.call(runPlp, runPlpSettings)
+ #result <- tryCatch(do.call(runPlp, runPlpSettings),
+ # finally= ParallelLogger::logTrace('Done runPlp.'),
+ # error= function(cond){ParallelLogger::logTrace(paste0('Error with runPlp:',cond));return(NULL)})
}
}
diff --git a/R/RunPlp.R b/R/RunPlp.R
index 34d0f76..8bbec2a 100644
--- a/R/RunPlp.R
+++ b/R/RunPlp.R
@@ -215,7 +215,7 @@ runPlp <- function(population, plpData, minCovariateFraction = 0.001, normalize
ParallelLogger::logDebug(paste0('testSplit: ', testSplit))
checkInStringVector(testSplit, c('person','time', 'stratified','subject'))
ParallelLogger::logDebug(paste0('outcomeCount: ', sum(population[,'outcomeCount']>0)))
- checkHigherEqual(sum(population[,'outcomeCount']>0), 25)
+ checkHigherEqual(sum(population[,'outcomeCount']>0), 0)
ParallelLogger::logDebug(paste0('plpData class: ', class(plpData)))
checkIsClass(plpData, c('plpData'))
ParallelLogger::logDebug(paste0('testfraction: ', testFraction))
What features were you using? The data shouldn't have NA values
I had an experience similar to this issue. I also injected external covariates into the covariateData file. When my data had 0 value in covariateValue column of the covariateData, then I got NaN error when running random forest (but the error didn't occur when running lasso or gbm). After I removed zero values in external covariates, the error was resolved. It seems that zero value converted to NaN in the r_to_py step maybe-.
Interesting - I'll test removing entries with a 0 value to prevent this as the sparse matrix defaults to a value of 0.