LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

[R-package] lgb.importance(): Error: R character strings are limited to 2^31-1 bytes

Open Chuang1128 opened this issue 1 year ago • 7 comments

Hello!

I built a lightgbm model, and then I used the lgb.importance(model). finally, the r show the Error: R character strings are limited to 2^31-1 bytes.

how do I solve this error? Thank you!

Chuang1128 avatar Jan 27 '24 04:01 Chuang1128

Thanks for using LightGBM.

An error message alone is not enough information for us to help you. Please provide the following:

  • version of R
  • version of {lightgbm}
  • how you installed LightGBM
  • operating system
  • output of running sessionInfo() in your R session (if possible)
  • a minimal, reproducible example that generates this error (docs with some guidance on that)

Here's an example of how to create a reproducible example for the R package: https://github.com/microsoft/LightGBM/issues/4721#issue-1036595701

jameslamb avatar Jan 27 '24 04:01 jameslamb

version of R: 4.3.2

version of {lightgbm}: 3.3.5

how you installed LightGBM: Tool>Install package

operating system: windows

output of running sessionInfo() in your R session (if possible) R version 4.3.2 (2023-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045) Matrix products: default

Chuang1128 avatar Jan 27 '24 05:01 Chuang1128

Thank you for that!

Tool>Install package

What does this mean?

version of {lightgbm}: 3.3.5

Can you please update to the latest version (v4.3.0) from CRAN and try again?

install.packages("lightgbm", repos = "https://cran.r-project.org")

And after that...we still won't be able to help much without a minimal, reproducible example.

jameslamb avatar Jan 27 '24 05:01 jameslamb

my lightgbm package version is 3.3.5 it is still have an error. and I will try reproducible example

Chuang1128 avatar Jan 27 '24 05:01 Chuang1128

train model

model = lgb.train(params = list(objective = "regression", num_iterations = 100, metric = "l2", min_data = 1L, min_data_in_bin=100, min_gain_to_split = 10), data = train, nrounds = 100))

lgb_imp <- lgb.importance(model)

otherwise, when I decrease the num_iterations, the will not show the error.

Chuang1128 avatar Jan 27 '24 05:01 Chuang1128

my lightgbm package version is 3.3.5

Sorry if my placement was confusing. I'm asking what "Tool>Install package" means. Are those buttons you're clicking in an application? If so, what application?

train model

Thanks for this! But it is not a reproducible example.

Crucially... what does train contain? Much of LightGBM's behavior (like any machine learning framework) is dependent on the size, shape, and distribution of the input data.

For example, based only on the error message you've provided, I can think of a few possibilities:

  • your data has features with huge feature names
  • your data has a very large number of rows
  • you have a very large number of features

If you can't provide a reproducible example, can you please at least show the code you used to construct train? Including any code for reading in data from files, databases, etc.

And report the size of the dataset (number of rows, number of columns, exact feature names if there are any).

jameslamb avatar Jan 27 '24 05:01 jameslamb

yes, I click the buttons in R studio to install the lightgbm package.

data <- readRDS("D:/data.rds") dtrain <- lgb.Dataset(data=as.matrix(data[,-1]), label = data[,1]) model = lgb.train(params = list(objective = "regression", num_iterations = 100, metric = "l2", min_data = 1L, min_data_in_bin=100, min_gain_to_split = 10), data = train, nrounds = 100))

lgb_imp <- lgb.importance(model)

my dataset: number of rows: 4610000 number of columns: 21

Chuang1128 avatar Jan 27 '24 06:01 Chuang1128

Hi @jameslamb, I'm having the same issue when running lightgbm::lgb.model.dt.tree(lgb_model) and I believe the culprit is here lgb.dump(booster = model, num_iteration = num_iteration). My dataset is also very large with a large number of rows and columns, and is fit with a complex model (i.e., not easy to share).

It looks like lgb.dump is trying to return a single long character string, when num_iteration = NULL, I wonder if instead it could be iterated into a large list or something? i.e., lapply(1:booster$current_iter(), booster$dump_model)

Edit: Never mind, the above won't work because dump_model will return everything up to the selected iteration, not just the selected iteration. For context, I'm trying to run this through treeshap::unify()

p-schaefer avatar Mar 07 '24 14:03 p-schaefer

I'm closing this in favor of #6380, which describes the same problem thoroughly with a reproducible example. Let's please focus there.

jameslamb avatar Mar 22 '24 14:03 jameslamb