LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

Extremely slow data loading on high dimensional dataset.

Open yinzheng-zhong opened this issue 7 months ago • 0 comments

I am working with 4^11 features and it stuck on a single thread for 15 hours to load a 30G dataset. I can see the time has been spent waiting for

# basic.py line 2146
_LIB.LGBM_DatasetCreateFromFile(). 

I haven't looked into all the C++ code yet but if I work with 4^10 dimensional data, it takes around an hour to load. I think the problem is directly linked to the dimensionality of the dataset.

In addition, I am using Python and I have tried to load data from libsvm format as well as the dense Numpy array, both show the same result. I suppose the work on 4^12 dimensional data but this makes it impossible to work with. I have tried xgboost which only takes a few minutes to load the data and start training. It will be great if I can use LightBGM as it uses less RAM.

I saw other issues that might be relevant but not exactly the same. e.g. #4037 Any suggestion is appreciated. Thank you.

yinzheng-zhong avatar Jul 10 '24 19:07 yinzheng-zhong