LightGBM
LightGBM copied to clipboard
Extremely slow data loading on high dimensional dataset.
I am working with 4^11 features and it stuck on a single thread for 15 hours to load a 30G dataset. I can see the time has been spent waiting for
# basic.py line 2146
_LIB.LGBM_DatasetCreateFromFile().
I haven't looked into all the C++ code yet but if I work with 4^10 dimensional data, it takes around an hour to load. I think the problem is directly linked to the dimensionality of the dataset.
In addition, I am using Python and I have tried to load data from libsvm format as well as the dense Numpy array, both show the same result. I suppose the work on 4^12 dimensional data but this makes it impossible to work with. I have tried xgboost which only takes a few minutes to load the data and start training. It will be great if I can use LightBGM as it uses less RAM.
I saw other issues that might be relevant but not exactly the same. e.g. #4037 Any suggestion is appreciated. Thank you.