LightGBM Feature Requests & Voting Hub

This issue is to maintain all features request on one page.

Note to contributors: If you want to work for a requested feature, re-open the linked issue. Everyone is welcome to work on any of the issues below.

Note to maintainers: All feature requests should be consolidated to this page. When there are new feature request issues, close them and create the new entries, with the link to the issues, in this page. The one exception is issues marked good first issue...these should be left open so they are discoverable by new contributors.

Call for Voting

we would like to call the voting here, to prioritize these requests. If you think a feature request is very necessary for you, you can vote for it by the following process:

got the issue (feature request) number.
search the number in this issue, check the voting of it exists or not.
if the voting exists, you can add 👍 to that voting
if the voting doesn't exist, you can create a new voting by replying to this thread, and add the number in the it.

Discussions

Efficiency improvements (#2791)
Accuracy improvements (#2790)

Efficiency related

[ ] Lock-free Inference (#4290)
[ ] Faster Split (data partition) (#2782)
[ ] Numa-aware (#1441)
[ ] Enable MM_PREFETCH and MM_MALLOC on aarch64 (#4124)
[ ] Optimisations for Apple Silicon (#3606)
[ ] Continued accerelate ConstructHistogram (#2786)
[ ] Accelerate the data loading from file (#2788)
[ ] Accelerate the data loading from Python/R object (#2789)
[ ] Fast feature grouping when the number of features is large (#4037)
[ ] Allow training without loading full dataset into memory (#5094)
[ ] Random number generation on CUDA (#5471)
[x] Faster LambdaRank (#2701)
[x] Improve efficiency of tree output renew for L1 regression with new CUDA version (#5459)

Effectiveness related

[ ] Better Regularization for Categorical features (#1934)
[ ] Raise a warning when missing values are found in label (#4483)
[ ] Support monotone constraints with quantile objective (#3371)
[ ] Pairwise Ranking/Scoring in LambdaMART (#6147)

Distributed platform and GPU (OpenCL-based and CUDA)

[ ] YARN support (#790)
[ ] Multiple GPU support (OpenCL version) (#620)
[ ] GPU performance improvement (#768)
[ ] Implement workaround for required folder permissions to compile kernels (#2955)
[ ] Support for single precision float in CUDA version (#3836)
[ ] Support Windows in CUDA version (#3837)
[ ] Support LightGBM on macOS with real (possibly external) GPU device (#4333)
[ ] multi-node, multi-GPU training with CUDA version (#5993)
[x] Build Python wheels that support both GPU and CPU versions out of the box for non-Windows (#4684)
[x] GPU binaries release (#2263)
[x] Support GPU with the conda-forge package (#5419)

Maintenance

[ ] Run tests with ClangCL and Visual Studio on our CI services (#5280)
[ ] Missing values during prediction should throw an Exception if missing data wasn't present during training (#4040)
[ ] Better document missing values behavior during prediction (#2921)
[ ] Code refactoring (#2341)
[ ] Support int64_t data_size_t (#2818)
[ ] Unify out results of LGBM_BoosterDumpModel and LGBM_BoosterSaveModel (#2604)
[ ] More tests (#261, #3841)
[ ] Publish lib_lightgbm.dll symbols to Microsoft Symbols Server (#1725)
[ ] Enhance parameter tuning guide with more params and scenarios (suggested ranges) for different tasks/datasets (#2617)
[ ] Better documentation for loss functions (#4790)
[ ] Add a page for listing related projects and links (#4576, original discussion: #4529)
[ ] Add BoostFromAverage value as an attribute to LightGBM models (#4313)
[ ] Regression tests on Dataset format (#4406)
[ ] Regression tests on model files (#4407)
[ ] Better warning information when splitting of a tree being stopped early (#4649).
[ ] Adding checks for nullptr in the code (#5085)
[ ] Add support for CRLF line endings or improve documentation and error message (#5508)
[ ] Support more build customizations in Conan recipe (#5770)
[ ] Support C++20 (#6033)
[ ] Add ability to predict on Dataset (#2666, #6613, #6285)
[ ] Add support for deploying to Android / iOS (#6592)
[x] Refactor CMakeLists.txt so that it will be possible to build cpp tests with different options, e.g. with OpenMP support (#4125)
[x] Ensure consistent behavior when multiple parameter aliases given (#5304)
[x] Remove unused-command-line-argument warning with Apple Clang (#1805)
[x] CI via GitHub actions (#2353)
[x] Debug flag in CMake configuration (#1588)
[x] Fix cpp lint problems (#1990)

Python package:

[ ] Refine pandas support (#960)
[ ] Refine categorical feature support (#1021)
[ ] Auto early stopping in Sklearn API (#3313)
[ ] Refactor sklearn wrapper after stabilizing upstream API, public API compatibility tests and official documentation (also after maturing HistGradientBoosting) (#2966, #2628)
[ ] Keep constants in sync with C++ library (#4321)
[ ] Allowing custom objective / metric function with "objective" and "metric" parameters (#3244)
[ ] Replace calls of POINTER() by byref() in Python interface to pass data arrays (#4298)
[ ] staged_predict() in the scikit-learn API (#5031)
[ ] Make Dataset pickleable (#5098)
[ ] Accept polars input (#6204)
[x] Add feature_names_in_ and related APIs to scikit-learn estimators (#6279)
[x] Load back saved parameters with save_model to Booster object (#2613)
[x] Check input for prediction (#812, #3626)
[x] Migrate to parametrize_with_checks for scikit-learn integration tests (#2947)

R package:

[ ] Rewrite R demos (#1944)
[ ] lgb.convert_with_rules() should validate rules (#2682)
[ ] Reduce duplication in Makevars.in, Makevars.win (#3249)
[ ] Add an R GPU job in CI (#3780)
[ ] Improve portability of OpenMP checks in R-package configure on macOS (#4537)
[x] Add CI job testing R package on Windows with UCRT toolchain (#4881)
[x] Load back saved parameters with save_model to Booster object (#2613)
[x] Use macOS 11.x in R 4.x CI jobs (#4990)
[x] Add a CI job running rchk (#4400)
[x] Factor out custom R interface to lib_lightgbm (#3016)
[x] Use commandArgs instead of hardcoded stuff in the installation script (#2441)
[x] lgb.convert functions should convert columns of type 'logical' (#2678)
[x] lgb.convert functions should warn on unconverted columns of unsupported types (#2681)
[x] lgb.prepare() and lgb.prepare2() should be simplified (#2683)
[x] lgb.prepare_rules() and lgb.prepare_rules2() should be simplified (#2684)
[x] Remove lgb.prepare() and lgb.prepare_rules() (#3075)
[x] CRAN-compliant installation configuration (#2960)
[x] Add tests on R 4.0 (#3024)
[x] Add pkgdown documentation support (#1143)
[x] Cover 100% of R-to-C++ calls in R unit tests (#2944)
[x] Bump version of pkgdown (#3036)
[x] Run R CI in Windows environment (#2335)
[x] Add unit tests for best metric iteration/value (#2525)
[x] Standardize R code on comma-first (#2373)
[x] Add additional linters to CI (#2477)
[x] Support roxygen 7.0.0+ (#2569)
[x] Run R CI in Linux and Mac environments (#2335)

New features

[ ] More platforms support (#1129, #4736)
[ ] CoreML support (#1074)
[ ] Object importance (#1460)
[ ] Include init_score in predict function (#1978)
[ ] Hyper-parameter per feature/column (#1938)
[ ] Extracting decision path (#2187)
[ ] Support for extremely large model (#2265, #3858)
[ ] Allow LightGBM to be easily used in external projects via modern CMake style with find_package and target_link_libraries (#4067, #3925)
[ ] Recalculate feature importance during the update process of a tree model (#2413)
[ ] Merge Dataset objects on condition that they hold same binmapper (#2579)
[ ] Spike and slab feature sampling priors (feature weighted sampling) (#2542)
[ ] Customizable early stopping tolerance (#2526)
[ ] Stop training branch of tree once a specific feature is used (#2518)
[ ] Subsampling rows with replacement (#1038)
[ ] Arbitrary base learner (#3180)
[ ] Different quantization techniques (#3707)
[ ] SHAP feature contribution for linear trees (#4002)
[ ] [SWIG] Add support for int64_t ChunkedArray (#4091)
[ ] Monotonicity in quantile regression (#3447, #4201)
[ ] Add approx_contrib option for feature contributions (#4219)
[ ] Support forced splits with data and voting parallel versions of LightGBM (#4260)
[ ] Support ignoring some features during training on constructed dataset (#4317)
[ ] Using random uniform sentinel features to avoid overfitting (#4622)
[ ] Allow specifying probability measure for features (#4605)
[ ] extra_trees by feature (#4700)
[ ] Compute partial dependencies from learned trees (#4578)
[ ] Boosting a linear model (Single-leaf trees with one-variable linear models in roots, like gblinear) (#4459)
[ ] Exactly control with min_child_sample (#5236)
[ ] WebAssembly support (#5372)
[ ] Support custom objective in refit (#5609)
[ ] Multiple trees in a single boosting round (#6294)
[ ] Allow JSON special characters in feature names (#6202)
[x] Decouple boosting types (#3128, #2991)
[x] Expose number of bins used by the model while binning continuous features to C API (#3406)
[x] Add C API function that returns all parameter names with their aliases (#2633)
[x] Pre-defined bin_upper_bounds (#1829)
[x] Setup editorconfig (#2401)
[x] Colsample by node (#2315)
[x] Smarter Backoffs for MPI ring connection (#2348)
[x] UTF-8 support for model file (#2478)

New algorithms:

[ ] Regularized Greedy Forest (#315)
[ ] Accelerated Gradient Boosting (#1257)
[ ] Multi-Layered Gradient Boosting Decision Trees (#1423)
[ ] Adaptive neural tree (#1542)
[ ] Probabilistic Forecasting (#3200)
[ ] Probabilistic Random Forest (#1946)
[ ] Sparrow (#2001)
[ ] Minimal Variance Sampling (MVS) in Stochastic Gradient Boosting (#2644)
[ ] Investigate possibility of borrowing some features/ideas from Explainable Boosted Machines (#3905)
[ ] Feature Cycling as an option instead of Random Feature Sampling (#4066)
[ ] Periodic Features (#4281)
[ ] GPBoost (#2790)
[x] Piece-wise linear tree (#1315)
[x] Extremely randomized trees (#2583)

Objective and metric functions:

[ ] Multi-output regression (#524)
[ ] Earth Mover Distance (#1256)
[ ] Cox Proportional Hazard Regression (#1837)
[ ] Native support for Focal Loss (#3706)
[ ] Ranking metric for regression objective (#1911)
[ ] Density estimation (#2056)
[ ] Adding correlation metrics (#4209)
[ ] Add parameter to control maximum group size for Lambdarank (#5053)
[x] Precision recall AUC (#3026)
[x] AUC Mu (#2344)

Python package:

[ ] Support access to constructed Dataset (#5191)
[ ] Support complex data types in categorical columns of pandas DataFrame (#2134)
[ ] First-class support for different data types (do not convert everything to float32/64 to save memory) (#3459, #3386)
[ ] Efficient native support of pandas.DataFrame with mixed dense and sparse columns (#4153)
[ ] Include init_score on the Python Booster class (#4065)
[ ] Better support for Tree Plot with multi class (#3061)
[ ] Support specifying number of iterations in dataset evaluation (#4210)
[ ] Compute metrics not on each iteration but with some fixed step (#4107)
[x] Support saving and loading CVBooster (#3556)
[x] Add a function to plot tree with a case (#4784)
[x] Allow custom loggers that don't inherit from logging.Logger (#4783)
[x] Ensure all callbacks are pickleable (#5080)
[x] Add support for pandas nullable types to the sklearn api (#4173)
[x] Support weight in refit (#3038)
[x] Keep cv predicted values (#283)
[x] Feature importance in CV (#1445)
[x] Log redirect in python (#1493)
[x] Make _CVBooster public for better stacking experience (#2105)

Dask:

[ ] Investigate how the gap between local and Dask predictions can be decreased (#3835)
[ ] Allow customization of num_threads (#3714)
[ ] Add support for early stopping (#3712)
[ ] Support init_model (#4063)
[ ] Make Dask training resilient to worker restarts during network setup (#3775)
[ ] GPU support (#3776)
[ ] Support MPI in Dask (#3831)
[ ] Support more operating systems (#3782)
[ ] Add LGBMModel (#3845)
[ ] Add train() function (#3846)
[ ] Add cv() function (#3847)
[ ] Support asynchronous workflows (#3929)
[ ] Add DaskDataset (#3944)
[ ] Enable larger pred_contrib results for multiclass classification with sparse matrices (#4438)
[ ] Use or return all workers eval_set evaluation data (#4392)
[ ] Drop 'not evaluated' placeholder from dask.py (#4393)
[x] Support custom objective functions (#3934)
[x] Resolve differences in result shape between DaskLGBMClassifier.predict() and LGBMClassifier.predict() (#3881)
[x] Support custom evaluation metrics (#3956)
[x] Support all LightGBM parallel tree learners (#3834)
[x] Support raw_score in predict() (#3793)
[x] Support all LightGBM boosting types (#3896)
[x] Tutorial documentation (#3814)
[x] Document how to save a Dask model (#3838)
[x] Support init_score (#3807)
[x] Search for ports only once per IP (#3768)
[x] Support pred_leaf in predict() (#3792)
[x] Decide and document how users should provide a Dask client at training time (#3808)
[x] Use dictionaries instead of tuples for parts (#3795)
[x] Remove testing dependency on dask-ml (#3796)
[x] Support 'pred_contrib' in predict() (#3713)
[x] Add support for LGBMRanker (#3708)
~Support DataTable in Dask (#3830)~

R package:

[ ] Add support for specifying training indices in lgb.cv() (#3924)
[ ] Export callback functions (#2479)
[ ] Plotting in R-package (#1222)
[ ] Add support for saving weight values of a node in the R-package (#2281)
[ ] Check parameters in cb.reset.parameters() (#2665)
[ ] Refit method for R-package (#2369)
[ ] Add the ability to predict on lgb.Dataset in Predictor$predict() (#2666)
[ ] Allow use of MPI from the R package (#3364)
[ ] Allow data to live in memory mapped file (#2184)
[ ] Add GPU support for CRAN package (#3206)
[ ] Add CUDA support for CRAN package (#3465)
[ ] Add CUDA support for CMake-based package (#5378)
[ ] Add function to generate a list of parameters (#4195)
[ ] Accept data frames as inputs (#4323)
[ ] Upgrade documentation site to pkgdown >2.0 (#4859)
[ ] Check size of custom objective function output (#4905)
[ ] Support CSR-format sparse matrices (#4966)
[x] Add flag of displaying train loss for lgb.cv() (#4911)
[x] Work directly with readRDS() and saveRDS() (#4296)
[x] Support trees with linear models at leaves (#3319)
[x] Add support for non-ASCII feature names (#2983)
[x] Release to CRAN (#629)
[x] Exclude training data from being checked for early stopping (#2472)
[x] first_metric_only parameter for R-package (#2368)
[x] Build a 32-bit version of LightGBM for the R package (#3187)
[x] Ability to control the printed messages (#1440)

New language wrappers:

[ ] MATLAB support (#743)
[ ] Java support (like xgboost4j) (#909)
[ ] Go support (predict part can be already found in https://github.com/dmitryikh/leaves package) (#2515)
[x] Ruby support (#2367)

Input enhancements:

[ ] Streaming data allocation (improve the sparse streaming support and expose ChunkedArray in C API) (#3995, https://github.com/microsoft/LightGBM/pull/3997#issuecomment-791969953)
[ ] String as categorical input directly (#789)
[ ] AWS S3 support (#1039)
[ ] H2O datatable direct support (not via to_numpy() method as it currently is) (#2003)
[ ] Multiple file as input (#2031)
[ ] Parquet file support (#1286)
[ ] Enable use of constructed Dataset in predict() methods (#4546, #1939, #6285)
[ ] support scipy sparse arrays (#6352)
[x] Apache Arrow support (#3369)
[x] Validation dataset creation via Sequence (#4184)

Aug 01 '19 03:08 guolinke

There’s a reference to minimum variance sampling here:

https://catboost.ai/docs/concepts/algorithm-main-stages_bootstrap-options.html

Although I think it just speeds up training rather than providing out of core training.

Oct 20 '19 08:10 onacrame

I would like to tackle the following issues on Python package. Could I discuss about a plan to fix? Also, where can we discuss that? IMHO, They will be resolved by improving to lightgbm.cv() function.

#2105: Make _CVBooster public for better stacking experience #283: Keep cv predicted values

I want to reopen the above issues, but I can not do that. Maybe I have no permission.

Jun 07 '20 14:06 momijiame

@momijiame Thank you for your interest! I've unlocked those issues for commenting. Please let's continue the discussion there.

Jun 10 '20 20:06 StrikerRUS

we would like to call the voting here, to prioritize these requests. If you think a feature request is very necessary for you, you can vote for it by the following process:

got the issue (feature request) number.
search the number in this issue, check the voting of it exists or not.
if the voting exists, you can add 👍 to that voting
if the voting doesn't exist, you can create a new voting by replying to this thread, and add the number in the it.

Aug 05 '20 23:08 guolinke

we would like to call the voting here

Let me start.

#2644

Aug 10 '20 21:08 StrikerRUS

It was proposed by me so I'm a little bit biased

Decouple boosting types #3128