LightGBM
LightGBM copied to clipboard
Support training data with dir[mutil libsvm data] in the CLI version
I am aware of the MMLSpark solution you referenced.
I use the dist XGBoost(not use spark), It support input the dir[mutil files]. But I want use LGBM to train my model. To overcome memory bottlenecks with pandas, I utilize the CLI version which is super efficient. The problem is that the CLI version wants ONE training file as input. If the training data is in many files (libsvm), which it usually is, concatenating them is a huge pain in the neck. If it is all going to end up inside LightGBM anyways, why not reading individual files and concatenating the data inside LightGBM?
Many thanks!
Thanks for using LightGBM!
Since you mentioned pandas
, I'm assuming you are comfortable working in Python.
Option 1 - use Dask
Would you consider the Dask interface in lightgbm.dask
? You could construct a dask.Array
from a directory of libsvm
files, then pass that into LightGBM training.
If you're open to that, I'd be happy to provide a reproducible example showing how to do that.
Option 2 - use lightgbm.Sequence
Alternatively, you could try the lightgbm.Sequence
interface in the Python package, which allows creating a Dataset
from batches of data.
See https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/dataset_from_multi_hdf5.py for an example of how to do this with a directory of hdf5 files. You could try modifying that code to work with libsvm files.
jameslamb
@jameslamb The train data size may 100G or large, so if use the pandas to load, may OOM. So the CLI is the best way to train dist lgbm. but the train data contains many parts.
if use the pandas to load, may OOM. So the CLI is the best way to train dist lgbm
To be clear, I didn't recommend using pandas
. The Dask interface performs distributed training the same way that the the CLI does. It's true that reading files into Python data structures like numpy
arrays might require more memory than the CLI uses when reading files, but I recommend that you consider trying it before assuming that it definitely won't work with the amount of data you have.
The Sequence
interface I recommended also does not require pandas
, and allows you to construct a Dataset
from a directory of files by reading in one file at a time and incrementally updating the Dataset
. With that interface, the entire raw training set never needs to be held in memory at one time.
There is already a feature request in this project's backlog for supporting providing a directory of files as input to training in the CLI (#2031). And linking a few other related conversations:
- #5055
- #5094
Just to set the right expectation...I doubt that that feature will be implemented by maintainers soon. There is significant other work that needs to be done in the project to get to its 4.0.0 release (see the conversation in #5153).
So if the Python options I've provided above don't work for your use case, and neither does Spark (as @StrikerRUS recommended to you in the discussion in #2031), then you will either need to watch those issues and wait for them to be implemented, or attempt to implement this support yourself and open a pull request adding it.
This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.