LightGBM
LightGBM copied to clipboard
Feature Requests & Voting Hub
This issue is to maintain all features request on one page.
Note to contributors: If you want to work for a requested feature, re-open the linked issue. Everyone is welcome to work on any of the issues below.
Note to maintainers: All feature requests should be consolidated to this page. When there are new feature request issues, close them and create the new entries, with the link to the issues, in this page. The one exception is issues marked good first issue
...these should be left open so they are discoverable by new contributors.
Call for Voting
we would like to call the voting here, to prioritize these requests. If you think a feature request is very necessary for you, you can vote for it by the following process:
- got the issue (feature request) number.
- search the number in this issue, check the voting of it exists or not.
- if the voting exists, you can add 👍 to that voting
- if the voting doesn't exist, you can create a new voting by replying to this thread, and add the number in the it.
Discussions
- Efficiency improvements (#2791)
- Accuracy improvements (#2790)
Efficiency related
- [ ] Lock-free Inference (#4290)
- [ ] Faster Split (data partition) (#2782)
- [ ] Numa-aware (#1441)
- [ ] Enable MM_PREFETCH and MM_MALLOC on aarch64 (#4124)
- [ ] Optimisations for Apple Silicon (#3606)
- [ ] Continued accerelate ConstructHistogram (#2786)
- [ ] Accelerate the data loading from file (#2788)
- [ ] Accelerate the data loading from Python/R object (#2789)
- [ ] Fast feature grouping when the number of features is large (#4037)
- [ ] Allow training without loading full dataset into memory (#5094)
- [ ] Random number generation on CUDA (#5471)
- [x] Faster LambdaRank (#2701)
- [x] Improve efficiency of tree output renew for L1 regression with new CUDA version (#5459)
Effectiveness related
- [ ] Better Regularization for Categorical features (#1934)
- [ ] Raise a warning when missing values are found in
label
(#4483) - [ ] Support monotone constraints with quantile objective (#3371)
- [ ] Pairwise Ranking/Scoring in LambdaMART (#6147)
Distributed platform and GPU (OpenCL-based and CUDA)
- [ ] YARN support (#790)
- [ ] Multiple GPU support (OpenCL version) (#620)
- [ ] GPU performance improvement (#768)
- [ ] Implement workaround for required folder permissions to compile kernels (#2955)
- [ ] Support for single precision float in CUDA version (#3836)
- [ ] Support Windows in CUDA version (#3837)
- [ ] Support LightGBM on macOS with real (possibly external) GPU device (#4333)
- [ ] multi-node, multi-GPU training with CUDA version (#5993)
- [x] Build Python wheels that support both GPU and CPU versions out of the box for non-Windows (#4684)
- [x] GPU binaries release (#2263)
- [x] Support GPU with the conda-forge package (#5419)
Maintenance
- [ ] Run tests with ClangCL and Visual Studio on our CI services (#5280)
- [ ] Missing values during prediction should throw an Exception if missing data wasn't present during training (#4040)
- [ ] Better document missing values behavior during prediction (#2921)
- [ ] Code refactoring (#2341)
- [ ] Support int64_t data_size_t (#2818)
- [ ] Unify out results of
LGBM_BoosterDumpModel
andLGBM_BoosterSaveModel
(#2604) - [ ] More tests (#261, #3841)
- [ ] Publish
lib_lightgbm.dll
symbols to Microsoft Symbols Server (#1725) - [ ] Enhance parameter tuning guide with more params and scenarios (suggested ranges) for different tasks/datasets (#2617)
- [ ] Better documentation for loss functions (#4790)
- [ ] Add a page for listing related projects and links (#4576, original discussion: #4529)
- [ ] Add BoostFromAverage value as an attribute to LightGBM models (#4313)
- [ ] Regression tests on Dataset format (#4406)
- [ ] Regression tests on model files (#4407)
- [ ] Better warning information when splitting of a tree being stopped early (#4649).
- [ ] Adding checks for nullptr in the code (#5085)
- [ ] Add support for CRLF line endings or improve documentation and error message (#5508)
- [ ] Support more build customizations in Conan recipe (#5770)
- [ ] Support C++20 (#6033)
- [ ] Add ability to predict on
Dataset
(#2666, #6613, #6285) - [ ] Add support for deploying to Android / iOS (#6592)
- [x] Refactor
CMakeLists.txt
so that it will be possible to build cpp tests with different options, e.g. with OpenMP support (#4125) - [x] Ensure consistent behavior when multiple parameter aliases given (#5304)
- [x] Remove unused-command-line-argument warning with Apple Clang (#1805)
- [x] CI via GitHub actions (#2353)
- [x] Debug flag in CMake configuration (#1588)
- [x] Fix cpp lint problems (#1990)
Python package:
- [ ] Refine pandas support (#960)
- [ ] Refine categorical feature support (#1021)
- [ ] Auto early stopping in Sklearn API (#3313)
- [ ] Refactor sklearn wrapper after stabilizing upstream API, public API compatibility tests and official documentation (also after maturing
HistGradientBoosting
) (#2966, #2628) - [ ] Keep constants in sync with C++ library (#4321)
- [ ] Allowing custom objective / metric function with "objective" and "metric" parameters (#3244)
- [ ] Replace calls of POINTER() by byref() in Python interface to pass data arrays (#4298)
- [ ]
staged_predict()
in the scikit-learn API (#5031) - [ ] Make
Dataset
pickleable (#5098) - [ ] Accept
polars
input (#6204) - [x] Add
feature_names_in_
and related APIs toscikit-learn
estimators (#6279) - [x] Load back saved parameters with save_model to Booster object (#2613)
- [x] Check input for prediction (#812, #3626)
- [x] Migrate to
parametrize_with_checks
for scikit-learn integration tests (#2947)
R package:
- [ ] Rewrite R demos (#1944)
- [ ]
lgb.convert_with_rules()
should validate rules (#2682) - [ ] Reduce duplication in Makevars.in, Makevars.win (#3249)
- [ ] Add an R GPU job in CI (#3780)
- [ ] Improve portability of OpenMP checks in R-package configure on macOS (#4537)
- [x] Add CI job testing R package on Windows with UCRT toolchain (#4881)
- [x] Load back saved parameters with
save_model
to Booster object (#2613) - [x] Use macOS 11.x in R 4.x CI jobs (#4990)
- [x] Add a CI job running
rchk
(#4400) - [x] Factor out custom R interface to lib_lightgbm (#3016)
- [x] Use
commandArgs
instead of hardcoded stuff in the installation script (#2441) - [x]
lgb.convert
functions should convert columns of type 'logical' (#2678) - [x]
lgb.convert
functions should warn on unconverted columns of unsupported types (#2681) - [x]
lgb.prepare()
andlgb.prepare2()
should be simplified (#2683) - [x]
lgb.prepare_rules()
andlgb.prepare_rules2()
should be simplified (#2684) - [x] Remove
lgb.prepare()
andlgb.prepare_rules()
(#3075) - [x] CRAN-compliant installation configuration (#2960)
- [x] Add tests on R 4.0 (#3024)
- [x] Add pkgdown documentation support (#1143)
- [x] Cover 100% of R-to-C++ calls in R unit tests (#2944)
- [x] Bump version of pkgdown (#3036)
- [x] Run R CI in Windows environment (#2335)
- [x] Add unit tests for best metric iteration/value (#2525)
- [x] Standardize R code on comma-first (#2373)
- [x] Add additional linters to CI (#2477)
- [x] Support roxygen 7.0.0+ (#2569)
- [x] Run R CI in Linux and Mac environments (#2335)
New features
- [ ] More platforms support (#1129, #4736)
- [ ] CoreML support (#1074)
- [ ] Object importance (#1460)
- [ ] Include init_score in predict function (#1978)
- [ ] Hyper-parameter per feature/column (#1938)
- [ ] Extracting decision path (#2187)
- [ ] Support for extremely large model (#2265, #3858)
- [ ] Allow LightGBM to be easily used in external projects via modern CMake style with
find_package
andtarget_link_libraries
(#4067, #3925) - [ ] Recalculate feature importance during the update process of a tree model (#2413)
- [ ] Merge Dataset objects on condition that they hold same binmapper (#2579)
- [ ] Spike and slab feature sampling priors (feature weighted sampling) (#2542)
- [ ] Customizable early stopping tolerance (#2526)
- [ ] Stop training branch of tree once a specific feature is used (#2518)
- [ ] Subsampling rows with replacement (#1038)
- [ ] Arbitrary base learner (#3180)
- [ ] Different quantization techniques (#3707)
- [ ] SHAP feature contribution for linear trees (#4002)
- [ ] [SWIG] Add support for int64_t ChunkedArray (#4091)
- [ ] Monotonicity in quantile regression (#3447, #4201)
- [ ] Add approx_contrib option for feature contributions (#4219)
- [ ] Support forced splits with data and voting parallel versions of LightGBM (#4260)
- [ ] Support ignoring some features during training on constructed dataset (#4317)
- [ ] Using random uniform sentinel features to avoid overfitting (#4622)
- [ ] Allow specifying probability measure for features (#4605)
- [ ] extra_trees by feature (#4700)
- [ ] Compute partial dependencies from learned trees (#4578)
- [ ] Boosting a linear model (Single-leaf trees with one-variable linear models in roots, like gblinear) (#4459)
- [ ] Exactly control with
min_child_sample
(#5236) - [ ] WebAssembly support (#5372)
- [ ] Support custom objective in refit (#5609)
- [ ] Multiple trees in a single boosting round (#6294)
- [ ] Allow JSON special characters in feature names (#6202)
- [x] Decouple boosting types (#3128, #2991)
- [x] Expose number of bins used by the model while binning continuous features to C API (#3406)
- [x] Add C API function that returns all parameter names with their aliases (#2633)
- [x] Pre-defined bin_upper_bounds (#1829)
- [x] Setup editorconfig (#2401)
- [x] Colsample by node (#2315)
- [x] Smarter Backoffs for MPI ring connection (#2348)
- [x] UTF-8 support for model file (#2478)
New algorithms:
- [ ] Regularized Greedy Forest (#315)
- [ ] Accelerated Gradient Boosting (#1257)
- [ ] Multi-Layered Gradient Boosting Decision Trees (#1423)
- [ ] Adaptive neural tree (#1542)
- [ ] Probabilistic Forecasting (#3200)
- [ ] Probabilistic Random Forest (#1946)
- [ ] Sparrow (#2001)
- [ ] Minimal Variance Sampling (MVS) in Stochastic Gradient Boosting (#2644)
- [ ] Investigate possibility of borrowing some features/ideas from Explainable Boosted Machines (#3905)
- [ ] Feature Cycling as an option instead of Random Feature Sampling (#4066)
- [ ] Periodic Features (#4281)
- [ ] GPBoost (#2790)
- [x] Piece-wise linear tree (#1315)
- [x] Extremely randomized trees (#2583)
Objective and metric functions:
- [ ] Multi-output regression (#524)
- [ ] Earth Mover Distance (#1256)
- [ ] Cox Proportional Hazard Regression (#1837)
- [ ] Native support for Focal Loss (#3706)
- [ ] Ranking metric for regression objective (#1911)
- [ ] Density estimation (#2056)
- [ ] Adding correlation metrics (#4209)
- [ ] Add parameter to control maximum group size for Lambdarank (#5053)
- [x] Precision recall AUC (#3026)
- [x] AUC Mu (#2344)
Python package:
- [ ] Support access to constructed Dataset (#5191)
- [ ] Support complex data types in categorical columns of pandas DataFrame (#2134)
- [ ] First-class support for different data types (do not convert everything to float32/64 to save memory) (#3459, #3386)
- [ ] Efficient native support of pandas.DataFrame with mixed dense and sparse columns (#4153)
- [ ] Include init_score on the Python Booster class (#4065)
- [ ] Better support for Tree Plot with multi class (#3061)
- [ ] Support specifying number of iterations in dataset evaluation (#4210)
- [ ] Compute metrics not on each iteration but with some fixed step (#4107)
- [x] Support saving and loading CVBooster (#3556)
- [x] Add a function to plot tree with a case (#4784)
- [x] Allow custom loggers that don't inherit from
logging.Logger
(#4783) - [x] Ensure all callbacks are pickleable (#5080)
- [x] Add support for pandas nullable types to the sklearn api (#4173)
- [x] Support weight in refit (#3038)
- [x] Keep cv predicted values (#283)
- [x] Feature importance in CV (#1445)
- [x] Log redirect in python (#1493)
- [x] Make _CVBooster public for better stacking experience (#2105)
Dask:
- [ ] Investigate how the gap between local and Dask predictions can be decreased (#3835)
- [ ] Allow customization of
num_threads
(#3714) - [ ] Add support for early stopping (#3712)
- [ ] Support
init_model
(#4063) - [ ] Make Dask training resilient to worker restarts during network setup (#3775)
- [ ] GPU support (#3776)
- [ ] Support MPI in Dask (#3831)
- [ ] Support more operating systems (#3782)
- [ ] Add
LGBMModel
(#3845) - [ ] Add
train()
function (#3846) - [ ] Add
cv()
function (#3847) - [ ] Support asynchronous workflows (#3929)
- [ ] Add
DaskDataset
(#3944) - [ ] Enable larger
pred_contrib
results for multiclass classification with sparse matrices (#4438) - [ ] Use or return all workers eval_set evaluation data (#4392)
- [ ] Drop 'not evaluated' placeholder from dask.py (#4393)
- [x] Support custom objective functions (#3934)
- [x] Resolve differences in result shape between
DaskLGBMClassifier.predict()
andLGBMClassifier.predict()
(#3881) - [x] Support custom evaluation metrics (#3956)
- [x] Support all LightGBM parallel tree learners (#3834)
- [x] Support
raw_score
inpredict()
(#3793) - [x] Support all LightGBM boosting types (#3896)
- [x] Tutorial documentation (#3814)
- [x] Document how to save a Dask model (#3838)
- [x] Support
init_score
(#3807) - [x] Search for ports only once per IP (#3768)
- [x] Support
pred_leaf
inpredict()
(#3792) - [x] Decide and document how users should provide a Dask client at training time (#3808)
- [x] Use dictionaries instead of tuples for parts (#3795)
- [x] Remove testing dependency on dask-ml (#3796)
- [x] Support 'pred_contrib' in
predict()
(#3713) - [x] Add support for LGBMRanker (#3708)
- ~Support DataTable in Dask (#3830)~
R package:
- [ ] Add support for specifying training indices in
lgb.cv()
(#3924) - [ ] Export callback functions (#2479)
- [ ] Plotting in R-package (#1222)
- [ ] Add support for saving weight values of a node in the R-package (#2281)
- [ ] Check parameters in
cb.reset.parameters()
(#2665) - [ ] Refit method for R-package (#2369)
- [ ] Add the ability to predict on
lgb.Dataset
inPredictor$predict()
(#2666) - [ ] Allow use of MPI from the R package (#3364)
- [ ] Allow data to live in memory mapped file (#2184)
- [ ] Add GPU support for CRAN package (#3206)
- [ ] Add CUDA support for CRAN package (#3465)
- [ ] Add CUDA support for CMake-based package (#5378)
- [ ] Add function to generate a list of parameters (#4195)
- [ ] Accept data frames as inputs (#4323)
- [ ] Upgrade documentation site to
pkgdown >2.0
(#4859) - [ ] Check size of custom objective function output (#4905)
- [ ] Support CSR-format sparse matrices (#4966)
- [x] Add flag of displaying train loss for
lgb.cv()
(#4911) - [x] Work directly with
readRDS()
andsaveRDS()
(#4296) - [x] Support trees with linear models at leaves (#3319)
- [x] Add support for non-ASCII feature names (#2983)
- [x] Release to CRAN (#629)
- [x] Exclude training data from being checked for early stopping (#2472)
- [x] first_metric_only parameter for R-package (#2368)
- [x] Build a 32-bit version of LightGBM for the R package (#3187)
- [x] Ability to control the printed messages (#1440)
New language wrappers:
- [ ] MATLAB support (#743)
- [ ] Java support (like xgboost4j) (#909)
- [ ] Go support (predict part can be already found in https://github.com/dmitryikh/leaves package) (#2515)
- [x] Ruby support (#2367)
Input enhancements:
- [ ] Streaming data allocation (improve the sparse streaming support and expose
ChunkedArray
in C API) (#3995, https://github.com/microsoft/LightGBM/pull/3997#issuecomment-791969953) - [ ] String as categorical input directly (#789)
- [ ] AWS S3 support (#1039)
- [ ] H2O datatable direct support (not via
to_numpy()
method as it currently is) (#2003) - [ ] Multiple file as input (#2031)
- [ ] Parquet file support (#1286)
- [ ] Enable use of constructed Dataset in predict() methods (#4546, #1939, #6285)
- [ ] support scipy sparse arrays (#6352)
- [x] Apache Arrow support (#3369)
- [x] Validation dataset creation via Sequence (#4184)
There’s a reference to minimum variance sampling here:
https://catboost.ai/docs/concepts/algorithm-main-stages_bootstrap-options.html
Although I think it just speeds up training rather than providing out of core training.
I would like to tackle the following issues on Python package. Could I discuss about a plan to fix? Also, where can we discuss that? IMHO, They will be resolved by improving to lightgbm.cv()
function.
#2105: Make _CVBooster public for better stacking experience #283: Keep cv predicted values
I want to reopen the above issues, but I can not do that. Maybe I have no permission.
@momijiame Thank you for your interest! I've unlocked those issues for commenting. Please let's continue the discussion there.
we would like to call the voting here, to prioritize these requests. If you think a feature request is very necessary for you, you can vote for it by the following process:
- got the issue (feature request) number.
- search the number in this issue, check the voting of it exists or not.
- if the voting exists, you can add 👍 to that voting
- if the voting doesn't exist, you can create a new voting by replying to this thread, and add the number in the it.
we would like to call the voting here
Let me start.
#2644
It was proposed by me so I'm a little bit biased
Decouple boosting types #3128
GPU binaries release #2263
Enhance parameter tuning guide with more params #2617
Subsampling rows with replacement #1038
Piece-wise linear tree #1315 (also see PR https://github.com/microsoft/LightGBM/pull/3299)
Multi-output regression #524
Cox Proportional Hazard Regression #1837
Based on https://github.com/microsoft/LightGBM/issues/2983#issuecomment-722630931, I've updated this issue's description:
Note to maintainers: All feature requests should be consolidated to this page. When there are new feature request issues, close them and create the new entries, with the link to the issues, in this page. The one exception is issues marked
good first issue
...these should be left open so they are discoverable by new contributors.
I think that we should keep good first issue
issues open, so it's easy for new contributors to find them.
Read from multiple files #2031
Parquet file support #1286
Register custom objective / loss function #3244
Object importance #1460
read from multiple zipped libsvm format text files
Multiple GPU support #620
Multiple GPU support (#620) (From my experience, the xgboost with gpu seems faster than lightgbm with gpu.)
For everyone who was voting for multi-gpu support, please try our new experimental CUDA version which was kindly contributed by our friends from IBM. This version supports multi-GPU training. We will really appreciate any early feedback on this experimental feature (please create new issues, do not comment here).
How to install: https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version-experimental.
Argument to specify number of GPUs: https://lightgbm.readthedocs.io/en/latest/Parameters.html#num_gpu.
Support ignoring some features during training on constructed dataset #4317
Spike and slab feature sampling priors (feature weighted sampling) #2542
Quantile LightGBM: ensure monotonic #3447
SHAP feature contribution for linear trees #4002
Create dataset from pyarrow tables: #3369
Add support for CRLF line endings or improve documentation and error message #5508
Add parameter to control maximum group size for Lambdarank #5053
Allow training without loading full dataset into memory #5094