h2o4gpu icon indicating copy to clipboard operation
h2o4gpu copied to clipboard

h2o4gpu :Genetic algorithm along with Random Forest Regression produces error: terminate called after throwing an instance of 'thrust::system::system_error' what(): parallel_for failed: out of memory

Open Geerthy11 opened this issue 6 years ago • 1 comments

I am working on feature selection using Genetic Algorithm (GA) with Random forest regression model (h2o4gpu.RandomForest Regressor). The number of estimators is 100, rest of the parameters are default. Here, the fitness function for GA is RF model's MAE. My dataset is 1.51 MB and dimension is 4000*44. However, The following is the types of error i get after certain iterations (say 30-40) whenever i run the program:

terminate called after throwing an instance of 'thrust::system::system_error' what(): parallel_for failed: out of memory Aborted (core dumped)

terminate called after throwing an instance of 'dmlc::Error' what(): [08:58:38] /workspace/include/xgboost/./../../src/common/common.h:41: /workspace/src/tree/../common/device_helpers.cuh: 422: out of memory Stack trace: [bt] (0) /conda/envs/rapids/xgboost/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x24) [0x7f3f0b07fcb4] [bt] (1) /conda/envs/rapids/xgboost/libxgboost.so(+0x3267e2) [0x7f3f0b2a57e2] [bt] (2) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::tree::DeviceShard<xgboost::detail::GradientPairInternal >::EvaluateSplits(std::vector<int, std::allocator >, xgboost::RegTree const&, unsigned long)+0x1041) [0x7f3f0b2b48b1] [bt] (3) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::tree::DeviceShard<xgboost::detail::GradientPairInternal >::UpdateTree(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, xgboost::RegTree*, dh::AllReducer*)+0x131e) [0x7f3f0b2c7dfe] [bt] (4) /conda/envs/rapids/xgboost/libxgboost.so(+0x34a201) [0x7f3f0b2c9201] [bt] (5) /conda/envs/rapids/bin/../lib/libgomp.so.1(GOMP_parallel+0x42) [0x7f3f1c5bee92] [bt] (6) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal >::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&)+0x918) [0x7f3f0b2bae98] [bt] (7) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >)+0xa81) [0x7f3f0b105791] [bt] (8) /conda/envs/rapids/xgboost/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::ObjFunction)+0xd65) [0x7f3f0b106c95]

Aborted (core dumped)

The following are the specifications: Ubuntu 16.04.6 LTS Python 3.6.8 CUDA 10.2/ cuDNN -7.4.1 GPU model -Quadro GV100 Nvidia docker version : 18.09.6 RAM: 125 GB H2o4gpu is installed using PIP wheel for cuda 10.0 (https://s3.amazonaws.com/h2o-release/h2o4gpu/releases/stable/ai/h2o/h2o4gpu/0.3-cuda10/h2o4gpu-0.3.2-cp36-cp36m-linux_x86_64.whl)

Kindly provide your suggestions to this issue.

Geerthy11 avatar Jul 24 '19 14:07 Geerthy11

Could you provide a code snippet to reproduce it?

sh1ng avatar Sep 09 '19 12:09 sh1ng