LightGBM
LightGBM copied to clipboard
[python-package] Adding support for polars for input data
Summary
I think polars library is on the path to replace the majority of pandas use-cases. It is already being adopted by the community. We use it internally in my company for new projects and we try not to use pandas at all.
Motivation
Polars is blazingly fast and it has several times a lower memory footprint. There is no need to use extra memory to convert data into numpy or pandas to be used for training in LightGBM.
Description
I would like the following to be working, where data_train
and data_test
are instances of pl.DataFrame
y_train = data_train[col_target]
y_test = data_test[col_target]
X_train = data_train.select(cols_pred)
X_test = data_test.select(cols_pred)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
params = {
"boosting_type": "gbdt",
"objective": "regression",
"metric": {"l2", "l1"},
"learning_rate": 0.1,
"feature_fraction": 0.9,
"bagging_fraction": 0.8,
"bagging_freq": 5,
"verbose": 0,
"num_leaves": 42,
"max_depth": 5,
"num_iterations": 5000,
"min_data_in_leaf": 500,
"reg_alpha": 2,
"reg_lambda": 5,
}
gbm = lgb.train(
params,
lgb_train,
valid_sets=lgb_eval,
callbacks=[lgb.early_stopping(stopping_rounds=500)],
)
as of now I have to convert it into numpy matrices
y_train = data_train[col_target].to_numpy()
y_test = data_test[col_target].to_numpy()
X_train = data_train.select(cols_pred).to_numpy()
X_test = data_test.select(cols_pred).to_numpy()
Thanks for using LightGBM and for taking the time for writing this up.
I support lightgbm
taking on direct polars
integration... that project has reachrd a level of stability and popularity that warrants it.
Are you interested in contributing this?
I believe this will be greatly simplified with the great work from @borchero (#6034, #6163), and it will come down to adding something like:
if isinstance(data, pl_DataFrame):
__init_from_pyarrow_table(data.to_arrow(), ...)
so we should probably wait for his other PRs to be merged.
As a side note, passing data from polars without copying any data is the entire reason my PRs exist π polars has a to_arrow()
method on data frames and series which is zero-copy.
is the entire reason my PRs exist
amazing!!!
In the future, please share that context with us when proposing changes here. It helps us to make informed choices about how much time and energy to put into reviewing, and how to weight the costs of the new addition against the added maintenance burden.
@borchero are you interested in adding polars
support once the Arrow PRs are done? We'd love the help and we've really appreciated your high-quality contributions so far. I agree with @jmoralez that that's the right way to sequence this work.
In the future, please share that context with us when proposing changes here. It helps us to make informed choices about how much time and energy to put into reviewing, and how to weight the costs of the new addition against the added maintenance burden.
Sure, will do next time :smile: since there already was an open Arrow PR, I didn't think about going into "motivation" too much :eyes:
are you interested in adding polars support once the Arrow PRs are done?
The plan to pass polars to Arrow in our company's application code would have simply been to call .to_arrow()
in appropriate places. Do you think that there's much value in adding direct support for pl.DataFrame
and pl.Series
at the expense of an additional dependency and, hence, higher maintenance complexity?
we've really appreciated your high-quality contributions so far
ππΌππΌ
The plan to pass polars to Arrow in our company's application code would have simply been to call
.to_arrow()
in appropriate places. Do you think that there's much value in adding direct support forpl.DataFrame
andpl.Series
at the expense of an additional dependency and, hence, higher maintenance complexity?
The way I see it is that 2024 will be the year of polars adoption by major python ML packages. The easier you will make it for users to use it, the better the user experience will be overall.
I am glad to hear that this is being considered and my issue wasn't rejected at first glance.
On a different note, I tried to use LightGBM directly in rust https://github.com/detrin/lightgbm-rust and I will perhaps use it for use-case for testing. The pyarrow option is interesting, I will try it as well. Thanks @borchero could you link the PR here?
Do you think that there's much value in adding direct support for pl.DataFrame and pl.Series at the expense of an additional dependency and, hence, higher maintenance complexity?
This is a great point @borchero.
Taking on pyarrow
support directly definitely was worth it, as there were details like data types and handling contiguous vs. chunked arrays that added complexity, and therefore significant user benefit to having LightGBM handle that complexity internally in a way that best fits with how LightGBM works.
I'm not that familiar with polars
, so I don't know if there are similar complexities that'd be worth pulling inside of LightGBM to make things easier for users.
If not and it's literally just .to_arrow()
, then maybe just documentation + an informative error message suggesting the use of .to_arrow()
would be enough?
I guess as a first test, I'd want to understand how .to_arrow()
works... does it return a copy of the data, but in Arrow format? Or does polars
use the Arrow
format internally and does .to_arrow()
just return a pointer to that data that bypasses any other polars
-specific data structures?
Because if it's a copy... then having lightgbm
do .to_arrow()
internally would result in an unavoidable copy that could be avoided externally.
Consider something like this (pseudo-code, this won't run):
import polars as pl
import lightgbm as lgb
df = pl.read_csv("data.csv")
dtrain = lgb.Dataset(
df[["feature_1", "feature_2"]],
label=df["label"]
)
lgb.train(train_set=dtrain, ...)
If lightgbm
does a .to_arrow()
on that passed-in polars
DataFrame internally, then throughout training you'll be holding df
in memory and a copy in Arrow format created with .to_arrow()
.
I think that's result in higher peak memory usage than instead doing something like the following and passing in Arrow data directly
import polars as pl
import lightgbm as lgb
df = pl.read_csv("data.csv").to_arrow()
dtrain = lgb.Dataset(
df[["feature_1", "feature_2"]],
label=df["label"]
)
lgb.train(train_set=dtrain, ...)
Other things that might justify adding support for directly passing polars
DataFrames and series:
- does
polars
have its own data types that are significantly different frompyarrow
? e.g. does it have concepts frompandas
like nullable dtypes or categoricals? - if
.to_arrow()
returns a copy... is there some other API inpolars
that provides a pointer to the start of the underlying raw data? So thatLightGBM
might be able to construct alightgbm.Dataset
from it without any copying on the Python side?
lightgbm
's Python package doesn't really do any DataFrame aggregations, joins, filtering, etc. ... the types of operations that benefit from polars
backend. So I think the main benefit would be something like "lots of users want to use polars
, but it's difficult to know how to efficiently create a lightgbm.Dataset
from a polars
DataFrame".
I guess as a first test, I'd want to understand how .to_arrow() works... does it return a copy of the data, but in Arrow format? Or does polars use the Arrow format internally and does .to_arrow()
Polars internally keeps memory according to the arrow memory format. When you call to_arrow
we give a pointer according to that format to pyarrow
and you can continue via pyarrow
to move to pandas
, pyarrow
, or any other tool that consumes arrow.
Moving data in and out of polars via arrow is zero-copy.
Moving data in and out of polars via numpy can be zero-copy (it depends on the data type, null data and dimensions)
Moving data in and out of polars via numpy can be zero-copy (it depends on the data type, null data and dimensions)
Does this imply that potentially LigthGBM could use it even in my snippet above without allocating any new memory on the heap?
@detrin not for your training data, i.e. not for polars data frames. Polars uses column-wise storage, i.e. each of your columns is represented by a contiguous chunk of memory (but data for each column is potentially in different locations of the heap).
The only interface that is currently available to pass data to LightGBM from Python is via NumPy (disregarding the option to pass data as files), which uses a single array (=single chunk of memory) to store your data and uses row-wise storage. This means that each row is represented by a contiguous chunk of memory and rows are furthermore concatenated such that you end up with a single array.
As you can see, Polars' data representation is quite different to the NumPy data representation and, thus, data needs to be copied.
Side-note: to not require two copies, you should call .to_numpy(order="c")
on your Polars data frame, otherwise, you will end up with a single array (=single chunk of memory) with column-wise ordering as this is more efficient to generate. LightGBM will, however, not like this and copy data yet again.
The way to resolve this issue is to extend LightGBM's API, i.e. to allow other data formats to be passed from the Python API. Arrow is a natural choice since it is being used ever more and is the backing memory format for pandas. In fact, it allows you to pass data from any tool that provides data as Arrow to LightGBM without copying any data.
The only interface that is currently available to pass data to LightGBM from Python is via NumPy (disregarding the option to pass data as files),
This is not true.
The Python package supports all of these formats for raw data:
-
numpy
arrays -
pandas
DataFrames -
datatable
DataFrames (h2o's DataFrame format) -
scipy
CSC and CSR sparse matrices - CSV, TSV, and LibSVM files
- Python lists of lists
Start reading from here and you can see all those options:
https://github.com/microsoft/LightGBM/blob/516bde95015b05e57ff41b19d9bec19b0c48d7e6/python-package/lightgbm/basic.py#L2010
Also for reference https://numpy.org/doc/1.21/reference/arrays.ndarray.html#internal-memory-layout-of-an-ndarray.
It says
Data in new ndarrays is in the row-major (C) order, unless otherwise specified
"stored in a contiguous block of memory in row-major order" is not exactly the same as "row-wise", just wanted to add that link as it's a great reference for thinking about these concepts.
The Python package supports all of these formats for raw data:
Ah sorry about the misunderstanding! I think I phrased this a little too freely. I meant data formats that are useful for passing a Polars dataframe. Pandas is essentially treated the same as NumPy but adds a few more metadata. The other options are largely irrelevant for this particular instance.
"stored in a contiguous block of memory in row-major order" is not exactly the same as "row-wise", just wanted to add that link as it's a great reference for thinking about these concepts.
Yep, thanks! Once one has read through the NumPy docs, one also understands the statement that "polars' to_numpy()
method outputs Fortran-contiguous NumPy arrays by default" :smile:
So, if I understand it correctly as of now there is now way how to pass data from polars to LightGBM without copying the data in memory.
For the project I am working on I might use CLI as a workaround.
So, if I understand it correctly as of now there is now way how to pass data from polars to LightGBM without copying the data in memory.
Yes, you will have (at least) one copy of your data in memory along with the LightGBM-internal representation of your data that is optimized for building the tree.
For the project I am working on I might use CLI as a workaround.
Potentially, a viable temporary alternative might also be to pass data via files (?)
Potentially, a viable temporary alternative might also be to pass data via files (?)
Is it possible directly in python? I could then output data into temp file and load it in python by LightGBM.
See @jameslamb's comment above for LightGBM's "API":
CSV, TSV, and LibSVM files
You could e.g. write your data to CSV. Obviously, this introduces some performance hit.
Shameless plug: PerpetualBooster supports Polars input: https://github.com/perpetual-ml/perpetual
@jameslamb I just thought again about adding documentation about how to pass polars
data to LightGBM. Where do you think is the most appropriate place for this? I wouldn't want to add a note on polars
support for a bunch of Dataset.__init__
parameters as well as all the set_{label,group,...}
methods. The same applies to an informative error message that you suggest above.
I thought about adding a note on the Dataset
class docs but it's very minimal so far... wdyt?
Hey thanks for reviving this @borchero .
A lot has changed since the initial discussion. There's now a 1.0
release of polars
(and the corresponding commitment to more stability) and you've since added direct Arrow support to lightgbm
's Python package.
I wonder... could we add direct, transparent support for polars
inputs in lightgbm
without adding a dependency on polars
by just doing something like this?
def _is_polars(arr) -> bool:
return "polars." in str(arr.__class__) and callable(getattr(arr, "to_arrow", None))
# ... in any method where LightGBM accepts raw data ...
if _is_polars(data):
data = data.to_arrow()
Because if we did that, then we wouldn't need to document specific methods that have to be called on polars
inputs. From users' perspective, lightgbm
just supports polars
. If, in the future, this little trick proves insufficient, we could then consider taking on an optional dependency on polars
and handle that similar to how the pandas
and datatable
dependencies are handled.
This related discussion is also worth cross-linking: https://github.com/dmlc/xgboost/issues/10554
Hi @jameslamb,
Hope it's ok for me to jump in here - I contribute to pandas and Polars, and have fixed up several issues related to the interchange protocol mentioned in https://github.com/dmlc/xgboost/issues/10452
The interchange protocol provides a standardised way of converting between dataframe libraries, but has several limitations which may affect you, so I recommend not using it:
- no support for Series input
- unsupported datatypes (e.g. Date, nested datatypes)
- unreliable implementations: using it to convert to pandas is not recommended for pandas<2.0.2, and accessing the column buffers directly isn't recommended for pandas<3.0. My biggest criticism of the project is that implementations are tied to the dataframe libraries themselves, making updates to historical versions impossible
If all you need to do is convert to pyarrow, then I'd suggest you just do
if (pl := sys.modules.get('polars', None) is not None and isinstance(data, pl.DataFrame):
data = data.to_arrow()
If instead you need to perform dataframe operations in a library-agnostic manner, then Narwhals, an extremely lightweight compatibility layer between dataframe libraries which has zero dependencies, may be of interest to you (Altair recently adopted it for this purpose, see https://github.com/vega/altair/pull/3452, as did scikit-lego)
I'd be interested to see how I could be of help here, as I'd love to see Polars support in LightGBM happen π - if it may be useful to have a longer chat about how it would work, feel free to book some time in https://calendly.com/marcogorelli
I wonder... could we add direct, transparent support for
polars
inputs inlightgbm
without adding a dependency onpolars
by just doing something like this?
@ritchie46 pointed out this discussion to me, and I wanted to highlight recent work around the Arrow PyCapsule Interface. It's a way for Python packages to exchange Arrow data safely without prior knowledge of each other. If the input object has an __arrow_c_stream__
method, then you can call it to get a PyCapsule containing an Arrow C Stream pointer. Recent versions of pyarrow have implemented this in their constructors. I recently added this protocol to Polars in https://github.com/pola-rs/polars/pull/17676 and there's progress on getting wider ecosystem support.
You can use:
import pyarrow as pa
assert hasattr(input_obj, "__arrow_c_stream__")
table = pa.table(input_obj)
# pass table into existing API
Alternatively, this also presents an opportunity to access a stream of Arrow data without materializing it in memory all at once. You can use the following to only materialize a single Arrow batch at a time:
import pyarrow as pa
assert hasattr(input_obj, "__arrow_c_stream__")
reader = pa.RecordBatchReader.from_stream(input_obj)
for arrow_chunk in reader:
# do something
If the source is itself a stream (as opposed to a "Table
" construct where multiple batches are already materialized in memory), then you can import very large Arrow data while only materializing a single batch in memory at a time.
The PyCapsule Interface could also let you remove the pyarrow dependency if you so desired.