onnxmltools icon indicating copy to clipboard operation
onnxmltools copied to clipboard

Converting LightGBM Regressor to ONNX seems impossible with mixed float and string initital_types.

Open flaussy opened this issue 1 year ago • 1 comments

Hi,

I'm trying to convert a Light GBMRegressor using the convert_lightgbm function. I am using a mix of categorical (string) and float values.

However, when I try to specify differerent initial_types, I get this error :

RuntimeError: For operator LgbmRegressor (type: LgbmRegressor), at most 1 input(s) is(are) supported but we got 15 input(s) which are ['postal_code_mission', 'do_code', 'dz_code', 'agency_code', 'adecco_code', 'siret', 'cod_prs_prc', 'cod_zep_ctr', 'cod_sgm_tt_con', 'depenses_client', 'month', 'hourly_rate', 'contract_duration', 'tension', 'difficulty']

I think it means I can only specify one input, so it has to be either float or string but can't be both ?

I tried doing :

initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]

# Convert the LightGBM model to ONNX format
onnx_model = onnxmltools.convert_lightgbm(model, initial_types=initial_type)

It worked for the conversion, but when in inference I got this error :

InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(string)) , expected: (tensor(float))

Wich makes sense since I'm passing string to something that expect floats ...

Am I doing something wrong or is there no way to convert a LGBMRegressor in onnx format with both string and float tensor ?

flaussy avatar Jun 20 '24 13:06 flaussy

Any updates here?

ogencoglu avatar Sep 05 '24 20:09 ogencoglu

Any updates?

nil-andreu avatar Nov 26 '24 10:11 nil-andreu

Can you share more information about how you trained the model?

xadupre avatar Nov 27 '24 18:11 xadupre

Closing the issue. Feel free to reopen it.

xadupre avatar Dec 23 '24 13:12 xadupre

We're having the same issue, the _parse_lightgbm_simple_model doesn't handle the case where the inputs are derived from a pandas DataFrame (i.e. each column is a separate input variable, dtypes are heterogenous). So it just says "Okay, we have N inputs, one per column, LGTM," and then when the process gets to shape calculation, it dies because the shape calculator is only expecting a single input variable.

I'm not sure there's a good solution, other than the parser injecting some conversion logic that turns non-numeric categoricals into numerics, doing a concat, and then finally the tree ensemble. It would need to attach this categorical -> numeric mapping to the ensemble operator too, so the actual converter can use it to rewrite the trees.

khoover avatar May 02 '25 17:05 khoover