LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

Support training data with dir[mutil libsvm data] in the CLI version

Open zwqjoy opened this issue 2 years ago • 3 comments

I am aware of the MMLSpark solution you referenced.

I use the dist XGBoost(not use spark), It support input the dir[mutil files]. But I want use LGBM to train my model. To overcome memory bottlenecks with pandas, I utilize the CLI version which is super efficient. The problem is that the CLI version wants ONE training file as input. If the training data is in many files (libsvm), which it usually is, concatenating them is a huge pain in the neck. If it is all going to end up inside LightGBM anyways, why not reading individual files and concatenating the data inside LightGBM?

Many thanks!

zwqjoy avatar Aug 12 '22 04:08 zwqjoy

Thanks for using LightGBM!

Since you mentioned pandas, I'm assuming you are comfortable working in Python.

Option 1 - use Dask

Would you consider the Dask interface in lightgbm.dask? You could construct a dask.Array from a directory of libsvm files, then pass that into LightGBM training.

If you're open to that, I'd be happy to provide a reproducible example showing how to do that.

Option 2 - use lightgbm.Sequence

Alternatively, you could try the lightgbm.Sequence interface in the Python package, which allows creating a Dataset from batches of data.

See https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/dataset_from_multi_hdf5.py for an example of how to do this with a directory of hdf5 files. You could try modifying that code to work with libsvm files.

jameslamb avatar Aug 12 '22 19:08 jameslamb

jameslamb

@jameslamb The train data size may 100G or large, so if use the pandas to load, may OOM. So the CLI is the best way to train dist lgbm. but the train data contains many parts.

zwqjoy avatar Aug 25 '22 02:08 zwqjoy

if use the pandas to load, may OOM. So the CLI is the best way to train dist lgbm

To be clear, I didn't recommend using pandas. The Dask interface performs distributed training the same way that the the CLI does. It's true that reading files into Python data structures like numpy arrays might require more memory than the CLI uses when reading files, but I recommend that you consider trying it before assuming that it definitely won't work with the amount of data you have.

The Sequence interface I recommended also does not require pandas, and allows you to construct a Dataset from a directory of files by reading in one file at a time and incrementally updating the Dataset. With that interface, the entire raw training set never needs to be held in memory at one time.


There is already a feature request in this project's backlog for supporting providing a directory of files as input to training in the CLI (#2031). And linking a few other related conversations:

  • #5055
  • #5094

Just to set the right expectation...I doubt that that feature will be implemented by maintainers soon. There is significant other work that needs to be done in the project to get to its 4.0.0 release (see the conversation in #5153).

So if the Python options I've provided above don't work for your use case, and neither does Spark (as @StrikerRUS recommended to you in the discussion in #2031), then you will either need to watch those issues and wait for them to be implemented, or attempt to implement this support yourself and open a pull request adding it.

jameslamb avatar Aug 25 '22 03:08 jameslamb

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

github-actions[bot] avatar Sep 24 '22 04:09 github-actions[bot]

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

github-actions[bot] avatar Aug 15 '23 20:08 github-actions[bot]