LightGBM Very slow data loading for sparse dataset with a large number of features

Very slow data loading for sparse dataset with a large number of features

Open shiyu1994 opened this issue 2 years ago • 6 comments

Summary

Data loading can be very slow for sparse dataset with a large number of features, due to the following code snippet https://github.com/microsoft/LightGBM/blob/6de9bafaeb4de46b22c81e7199bb5de8b28e6174/src/io/dataset.cpp#L469-L484

Note that the inner for loop enumerate all features regardless of whether the feature has a non-empty value for the data i. For dataset like KDD 2010 (bridge to algebra version) in https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html, this would cost about 3 hours. And the data loading process will seem getting stuck forever to the users.

Motivation

Efficiency improvement is needed for dataset with large number of sparse features.

References

In LightGBM paper, datasets with large number of sparse features are tested. But after v3.0.0, row-wise histogram construction was introduced, along with the PushDataToMultiValBin shown above, which makes running such datasets difficult in current version. https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

May 10 '22 06:05 shiyu1994

I have a preliminary solution for this. And will fix the problem soon.

May 10 '22 06:05 shiyu1994

Thank you very much for writing this up!

May 10 '22 22:05 jameslamb

Hi, did this get fixed? I am using the cloned github version as of 9/2/2022. I have a really sparse matrix with millions of features. Loading the dataset is also very slow for me but I don't know if it has anything to do with the particular characteristics of my dataset.

Sep 02 '22 07:09 billytcl

Hi, I am facing the same problem. Any update on this issue?

Jan 27 '23 12:01 thongnt99

Any update on this issue?

Much time is exhausted in data-loading.

Apr 11 '23 07:04 AllenSun1024

I have a preliminary solution for this. And will fix the problem soon.

Dear shiyu,

Have you fixed it?

Thank you in advance.

Apr 11 '23 07:04 AllenSun1024

LightGBM LightGBM copied to clipboard

Very slow data loading for sparse dataset with a large number of features

Summary

Motivation

References

LightGBM
LightGBM copied to clipboard