LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

Feature Requests & Voting Hub

Open guolinke opened this issue 4 years ago • 43 comments

This issue is to maintain all features request on one page.

Note to contributors: If you want to work for a requested feature, re-open the linked issue. Everyone is welcome to work on any of the issues below.

Note to maintainers: All feature requests should be consolidated to this page. When there are new feature request issues, close them and create the new entries, with the link to the issues, in this page. The one exception is issues marked good first issue...these should be left open so they are discoverable by new contributors.

Call for Voting

we would like to call the voting here, to prioritize these requests. If you think a feature request is very necessary for you, you can vote for it by the following process:

  1. got the issue (feature request) number.
  2. search the number in this issue, check the voting of it exists or not.
  3. if the voting exists, you can add 👍 to that voting
  4. if the voting doesn't exist, you can create a new voting by replying to this thread, and add the number in the it.

Discussions

  • Efficiency improvements (#2791)
  • Accuracy improvements (#2790)

Efficiency related

  • [x] Faster LambdaRank (#2701)
  • [ ] Lock-free Inference (#4290)
  • [ ] Faster Split (data partition) (#2782)
  • [ ] Numa-aware (#1441)
  • [ ] Enable MM_PREFETCH and MM_MALLOC on aarch64 (#4124)
  • [ ] Optimisations for Apple Silicon (#3606)
  • [ ] Continued accerelate ConstructHistogram (#2786)
  • [ ] Accelerate the data loading from file (#2788)
  • [ ] Accelerate the data loading from Python/R object (#2789)
  • [ ] Fast feature grouping when the number of features is large (#4037)
  • [ ] Allow training without loading full dataset into memory (#5094)
  • [x] Improve efficiency of tree output renew for L1 regression with new CUDA version (#5459)
  • [ ] Random number generation on CUDA (#5471)

Effectiveness related

  • [ ] Better Regularization for Categorical features (#1934)
  • [ ] Raise a warning when missing values are found in label (#4483)
  • [ ] Support monotone constraints with quantile objective (#3371)
  • [ ] Pairwise Ranking/Scoring in LambdaMART (#6147)

Distributed platform and GPU (OpenCL-based and CUDA)

  • [ ] YARN support (#790)
  • [ ] Multiple GPU support (OpenCL version) (#620)
  • [ ] GPU performance improvement (#768)
  • [ ] Implement workaround for required folder permissions to compile kernels (#2955)
  • [ ] GPU binaries release (#2263)
  • [ ] Build Python wheels that support both GPU and CPU versions out of the box for non-Windows (#4684)
  • [ ] Support for single precision float in CUDA version (#3836)
  • [ ] Support Windows in CUDA version (#3837)
  • [ ] Support LightGBM on macOS with real (possibly external) GPU device (#4333)
  • [ ] Support GPU with the conda-forge package (#5419)
  • [ ] multi-node, multi-GPU training with CUDA version (#5993)

Maintenance

  • [ ] Run tests with ClangCL and Visual Studio on our CI services (#5280)
  • [ ] Missing values during prediction should throw an Exception if missing data wasn't present during training (#4040)
  • [ ] Better document missing values behavior during prediction (#2921)
  • [ ] Code refactoring (#2341)
  • [ ] Refactor CMakeLists.txt so that it will be possible to build cpp tests with different options, e.g. with OpenMP support (#4125)
  • [ ] Support int64_t data_size_t (#2818)
  • [ ] Unify out results of LGBM_BoosterDumpModel and LGBM_BoosterSaveModel (#2604)
  • [ ] More tests (#261, #3841)
  • [ ] Publish lib_lightgbm.dll symbols to Microsoft Symbols Server (#1725)
  • [ ] Enhance parameter tuning guide with more params and scenarios (suggested ranges) for different tasks/datasets (#2617)
  • [ ] Better documentation for loss functions (#4790)
  • [ ] Add a page for listing related projects and links (#4576, original discussion: #4529)
  • [ ] Add BoostFromAverage value as an attribute to LightGBM models (#4313)
  • [ ] Regression tests on Dataset format (#4406)
  • [ ] Regression tests on model files (#4407)
  • [ ] Better warning information when splitting of a tree being stopped early (#4649).
  • [ ] Adding checks for nullptr in the code (#5085)
  • [x] Ensure consistent behavior when multiple parameter aliases given (#5304)
  • [ ] Add support for CRLF line endings or improve documentation and error message (#5508)
  • [ ] Support more build customizations in Conan recipe (#5770)
  • [ ] Support C++20 (#6033)
  • [x] Remove unused-command-line-argument warning with Apple Clang (#1805)
  • [x] CI via GitHub actions (#2353)
  • [x] Debug flag in CMake configuration (#1588)
  • [x] Fix cpp lint problems (#1990)

Python package:

  • [x] Load back saved parameters with save_model to Booster object (#2613)
  • [x] Check input for prediction (#812, #3626)
  • [ ] Refine pandas support (#960)
  • [ ] Refine categorical feature support (#1021)
  • [ ] Auto early stopping in Sklearn API (#3313)
  • [ ] Refactor sklearn wrapper after stabilizing upstream API, public API compatibility tests and official documentation (also after maturing HistGradientBoosting) (#2966, #2628)
  • [ ] Keep constants in sync with C++ library (#4321)
  • [ ] Allowing custom objective / metric function with "objective" and "metric" parameters (#3244)
  • [ ] Replace calls of POINTER() by byref() in Python interface to pass data arrays (#4298)
  • [ ] staged_predict() in the scikit-learn API (#5031)
  • [ ] Make Dataset pickleable (#5098)
  • [ ] Accept polars input (#6204)
  • [ ] Add feature_names_in_ and related APIs to scikit-learn estimators (#6279)
  • [x] Migrate to parametrize_with_checks for scikit-learn integration tests (#2947)

R package:

  • [ ] Rewrite R demos (#1944)
  • [ ] lgb.convert_with_rules() should validate rules (#2682)
  • [ ] Reduce duplication in Makevars.in, Makevars.win (#3249)
  • [ ] Add an R GPU job in CI (#3780)
  • [ ] Improve portability of OpenMP checks in R-package configure on macOS (#4537)
  • [x] Add CI job testing R package on Windows with UCRT toolchain (#4881)
  • [x] Load back saved parameters with save_model to Booster object (#2613)
  • [x] Use macOS 11.x in R 4.x CI jobs (#4990)
  • [x] Add a CI job running rchk (#4400)
  • [x] Factor out custom R interface to lib_lightgbm (#3016)
  • [x] Use commandArgs instead of hardcoded stuff in the installation script (#2441)
  • [x] lgb.convert functions should convert columns of type 'logical' (#2678)
  • [x] lgb.convert functions should warn on unconverted columns of unsupported types (#2681)
  • [x] lgb.prepare() and lgb.prepare2() should be simplified (#2683)
  • [x] lgb.prepare_rules() and lgb.prepare_rules2() should be simplified (#2684)
  • [x] Remove lgb.prepare() and lgb.prepare_rules() (#3075)
  • [x] CRAN-compliant installation configuration (#2960)
  • [x] Add tests on R 4.0 (#3024)
  • [x] Add pkgdown documentation support (#1143)
  • [x] Cover 100% of R-to-C++ calls in R unit tests (#2944)
  • [x] Bump version of pkgdown (#3036)
  • [x] Run R CI in Windows environment (#2335)
  • [x] Add unit tests for best metric iteration/value (#2525)
  • [x] Standardize R code on comma-first (#2373)
  • [x] Add additional linters to CI (#2477)
  • [x] Support roxygen 7.0.0+ (#2569)
  • [x] Run R CI in Linux and Mac environments (#2335)

New features

  • [ ] CoreML support (#1074)
  • [ ] More platforms support (#1129, #4736)
  • [ ] Object importance (#1460)
  • [ ] Include init_score in predict function (#1978)
  • [ ] Hyper-parameter per feature/column (#1938)
  • [ ] Extracting decision path (#2187)
  • [ ] Support for extremely large model (#2265, #3858)
  • [ ] Allow LightGBM to be easily used in external projects via modern CMake style with find_package and target_link_libraries (#4067, #3925)
  • [ ] Recalculate feature importance during the update process of a tree model (#2413)
  • [ ] Merge Dataset objects on condition that they hold same binmapper (#2579)
  • [ ] Spike and slab feature sampling priors (feature weighted sampling) (#2542)
  • [ ] Customizable early stopping tolerance (#2526)
  • [ ] Stop training branch of tree once a specific feature is used (#2518)
  • [ ] Subsampling rows with replacement (#1038)
  • [ ] Arbitrary base learner (#3180)
  • [x] Decouple boosting types (#3128, #2991)
  • [ ] Different quantization techniques (#3707)
  • [ ] SHAP feature contribution for linear trees (#4002)
  • [ ] [SWIG] Add support for int64_t ChunkedArray (#4091)
  • [ ] Monotonicity in quantile regression (#3447, #4201)
  • [ ] Add approx_contrib option for feature contributions (#4219)
  • [ ] Support forced splits with data and voting parallel versions of LightGBM (#4260)
  • [ ] Support ignoring some features during training on constructed dataset (#4317)
  • [ ] Using random uniform sentinel features to avoid overfitting (#4622)
  • [ ] Allow specifying probability measure for features (#4605)
  • [ ] extra_trees by feature (#4700)
  • [ ] Compute partial dependencies from learned trees (#4578)
  • [ ] Boosting a linear model (Single-leaf trees with one-variable linear models in roots, like gblinear) (#4459)
  • [ ] Exactly control with min_child_sample (#5236)
  • [ ] WebAssembly support (#5372)
  • [ ] Support custom objective in refit (#5609)
  • [ ] Multiple trees in a single boosting round (#6294)
  • [x] Expose number of bins used by the model while binning continuous features to C API (#3406)
  • [x] Add C API function that returns all parameter names with their aliases (#2633)
  • [x] Pre-defined bin_upper_bounds (#1829)
  • [x] Setup editorconfig (#2401)
  • [x] Colsample by node (#2315)
  • [x] Smarter Backoffs for MPI ring connection (#2348)
  • [x] UTF-8 support for model file (#2478)

New algorithms:

  • [ ] Regularized Greedy Forest (#315)
  • [ ] Accelerated Gradient Boosting (#1257)
  • [ ] Multi-Layered Gradient Boosting Decision Trees (#1423)
  • [ ] Adaptive neural tree (#1542)
  • [ ] Probabilistic Forecasting (#3200)
  • [ ] Probabilistic Random Forest (#1946)
  • [ ] Sparrow (#2001)
  • [ ] Minimal Variance Sampling (MVS) in Stochastic Gradient Boosting (#2644)
  • [ ] Investigate possibility of borrowing some features/ideas from Explainable Boosted Machines (#3905)
  • [ ] Feature Cycling as an option instead of Random Feature Sampling (#4066)
  • [ ] Periodic Features (#4281)
  • [ ] GPBoost (#2790)
  • [x] Piece-wise linear tree (#1315)
  • [x] Extremely randomized trees (#2583)

Objective and metric functions:

  • [ ] Multi-output regression (#524)
  • [ ] Earth Mover Distance (#1256)
  • [ ] Cox Proportional Hazard Regression (#1837)
  • [ ] Native support for Focal Loss (#3706)
  • [ ] Ranking metric for regression objective (#1911)
  • [ ] Density estimation (#2056)
  • [ ] Adding correlation metrics (#4209)
  • [ ] Add parameter to control maximum group size for Lambdarank (#5053)
  • [x] Precision recall AUC (#3026)
  • [x] AUC Mu (#2344)

Python package:

  • [ ] Support access to constructed Dataset (#5191)
  • [ ] Support complex data types in categorical columns of pandas DataFrame (#2134)
  • [ ] First-class support for different data types (do not convert everything to float32/64 to save memory) (#3459, #3386)
  • [ ] Efficient native support of pandas.DataFrame with mixed dense and sparse columns (#4153)
  • [ ] Include init_score on the Python Booster class (#4065)
  • [ ] Better support for Tree Plot with multi class (#3061)
  • [ ] Support specifying number of iterations in dataset evaluation (#4210)
  • [ ] Compute metrics not on each iteration but with some fixed step (#4107)
  • [x] Support saving and loading CVBooster (#3556)
  • [x] Add a function to plot tree with a case (#4784)
  • [x] Allow custom loggers that don't inherit from logging.Logger (#4783)
  • [x] Ensure all callbacks are pickleable (#5080)
  • [x] Add support for pandas nullable types to the sklearn api (#4173)
  • [x] Support weight in refit (#3038)
  • [x] Keep cv predicted values (#283)
  • [x] Feature importance in CV (#1445)
  • [x] Log redirect in python (#1493)
  • [x] Make _CVBooster public for better stacking experience (#2105)

Dask:

  • [ ] Investigate how the gap between local and Dask predictions can be decreased (#3835)
  • [ ] Allow customization of num_threads (#3714)
  • [ ] Add support for early stopping (#3712)
  • [ ] Support init_model (#4063)
  • [ ] Make Dask training resilient to worker restarts during network setup (#3775)
  • [ ] GPU support (#3776)
  • [ ] Support MPI in Dask (#3831)
  • [ ] Support more operating systems (#3782)
  • [ ] Add LGBMModel (#3845)
  • [ ] Add train() function (#3846)
  • [ ] Add cv() function (#3847)
  • [ ] Support asynchronous workflows (#3929)
  • [ ] Add DaskDataset (#3944)
  • [ ] Enable larger pred_contrib results for multiclass classification with sparse matrices (#4438)
  • [ ] Use or return all workers eval_set evaluation data (#4392)
  • [ ] Drop 'not evaluated' placeholder from dask.py (#4393)
  • [x] Support custom objective functions (#3934)
  • [x] Resolve differences in result shape between DaskLGBMClassifier.predict() and LGBMClassifier.predict() (#3881)
  • [x] Support custom evaluation metrics (#3956)
  • [x] Support all LightGBM parallel tree learners (#3834)
  • [x] Support raw_score in predict() (#3793)
  • [x] Support all LightGBM boosting types (#3896)
  • [x] Tutorial documentation (#3814)
  • [x] Document how to save a Dask model (#3838)
  • [x] Support init_score (#3807)
  • [x] Search for ports only once per IP (#3768)
  • [x] Support pred_leaf in predict() (#3792)
  • [x] Decide and document how users should provide a Dask client at training time (#3808)
  • [x] Use dictionaries instead of tuples for parts (#3795)
  • [x] Remove testing dependency on dask-ml (#3796)
  • [x] Support 'pred_contrib' in predict() (#3713)
  • [x] Add support for LGBMRanker (#3708)
  • ~Support DataTable in Dask (#3830)~

R package:

  • [ ] Add support for specifying training indices in lgb.cv() (#3924)
  • [ ] Export callback functions (#2479)
  • [ ] Plotting in R-package (#1222)
  • [ ] Add support for saving weight values of a node in the R-package (#2281)
  • [ ] Check parameters in cb.reset.parameters() (#2665)
  • [ ] Refit method for R-package (#2369)
  • [ ] Add the ability to predict on lgb.Dataset in Predictor$predict() (#2666)
  • [ ] Allow use of MPI from the R package (#3364)
  • [ ] Allow data to live in memory mapped file (#2184)
  • [ ] Add GPU support for CRAN package (#3206)
  • [ ] Add CUDA support for CRAN package (#3465)
  • [ ] Add CUDA support for CMake-based package (#5378)
  • [ ] Add function to generate a list of parameters (#4195)
  • [ ] Accept data frames as inputs (#4323)
  • [ ] Upgrade documentation site to pkgdown >2.0 (#4859)
  • [ ] Check size of custom objective function output (#4905)
  • [ ] Support CSR-format sparse matrices (#4966)
  • [x] Add flag of displaying train loss for lgb.cv() (#4911)
  • [x] Work directly with readRDS() and saveRDS() (#4296)
  • [x] Support trees with linear models at leaves (#3319)
  • [x] Add support for non-ASCII feature names (#2983)
  • [x] Release to CRAN (#629)
  • [x] Exclude training data from being checked for early stopping (#2472)
  • [x] first_metric_only parameter for R-package (#2368)
  • [x] Build a 32-bit version of LightGBM for the R package (#3187)
  • [x] Ability to control the printed messages (#1440)

New language wrappers:

  • [ ] MATLAB support (#743)
  • [ ] Java support (like xgboost4j) (#909)
  • [ ] Go support (predict part can be already found in https://github.com/dmitryikh/leaves package) (#2515)
  • [x] Ruby support (#2367)

Input enhancements:

  • [ ] Streaming data allocation (improve the sparse streaming support and expose ChunkedArray in C API) (#3995, https://github.com/microsoft/LightGBM/pull/3997#issuecomment-791969953)
  • [ ] String as categorical input directly (#789)
  • [ ] AWS S3 support (#1039)
  • [ ] H2O datatable direct support (not via to_numpy() method as it currently is) (#2003)
  • [ ] Multiple file as input (#2031)
  • [ ] Parquet file support (#1286)
  • [x] Apache Arrow support (#3369)
  • [ ] Enable use of constructed Dataset in predict() methods (#4546, #1939, #6285)
  • [ ] support scipy sparse arrays (#6352)
  • [x] Validation dataset creation via Sequence (#4184)

guolinke avatar Aug 01 '19 03:08 guolinke

There’s a reference to minimum variance sampling here:

https://catboost.ai/docs/concepts/algorithm-main-stages_bootstrap-options.html

Although I think it just speeds up training rather than providing out of core training.

onacrame avatar Oct 20 '19 08:10 onacrame

I would like to tackle the following issues on Python package. Could I discuss about a plan to fix? Also, where can we discuss that? IMHO, They will be resolved by improving to lightgbm.cv() function.

#2105: Make _CVBooster public for better stacking experience #283: Keep cv predicted values

I want to reopen the above issues, but I can not do that. Maybe I have no permission.

momijiame avatar Jun 07 '20 14:06 momijiame

@momijiame Thank you for your interest! I've unlocked those issues for commenting. Please let's continue the discussion there.

StrikerRUS avatar Jun 10 '20 20:06 StrikerRUS

we would like to call the voting here, to prioritize these requests. If you think a feature request is very necessary for you, you can vote for it by the following process:

  1. got the issue (feature request) number.
  2. search the number in this issue, check the voting of it exists or not.
  3. if the voting exists, you can add 👍 to that voting
  4. if the voting doesn't exist, you can create a new voting by replying to this thread, and add the number in the it.

guolinke avatar Aug 05 '20 23:08 guolinke

we would like to call the voting here

Let me start.

#2644

StrikerRUS avatar Aug 10 '20 21:08 StrikerRUS

It was proposed by me so I'm a little bit biased

Decouple boosting types #3128

candalfigomoro avatar Aug 16 '20 18:08 candalfigomoro

GPU binaries release #2263

candalfigomoro avatar Aug 16 '20 18:08 candalfigomoro

Enhance parameter tuning guide with more params #2617

candalfigomoro avatar Aug 16 '20 18:08 candalfigomoro

Subsampling rows with replacement #1038

candalfigomoro avatar Aug 16 '20 18:08 candalfigomoro

Piece-wise linear tree #1315 (also see PR https://github.com/microsoft/LightGBM/pull/3299)

candalfigomoro avatar Aug 16 '20 18:08 candalfigomoro

Multi-output regression #524

candalfigomoro avatar Aug 16 '20 18:08 candalfigomoro

Cox Proportional Hazard Regression #1837

candalfigomoro avatar Aug 16 '20 18:08 candalfigomoro

Based on https://github.com/microsoft/LightGBM/issues/2983#issuecomment-722630931, I've updated this issue's description:

Note to maintainers: All feature requests should be consolidated to this page. When there are new feature request issues, close them and create the new entries, with the link to the issues, in this page. The one exception is issues marked good first issue...these should be left open so they are discoverable by new contributors.

I think that we should keep good first issue issues open, so it's easy for new contributors to find them.

jameslamb avatar Nov 07 '20 06:11 jameslamb

Read from multiple files #2031

gzls90 avatar Nov 12 '20 15:11 gzls90

Parquet file support #1286

gzls90 avatar Nov 12 '20 15:11 gzls90

Register custom objective / loss function #3244

gzls90 avatar Nov 12 '20 15:11 gzls90

Object importance #1460

gzls90 avatar Nov 12 '20 15:11 gzls90

read from multiple zipped libsvm format text files

wenmin-wu avatar Dec 05 '20 15:12 wenmin-wu

Multiple GPU support #620

gzls90 avatar Dec 08 '20 07:12 gzls90

Multiple GPU support (#620) (From my experience, the xgboost with gpu seems faster than lightgbm with gpu.)

7starsea avatar Dec 15 '20 01:12 7starsea

For everyone who was voting for multi-gpu support, please try our new experimental CUDA version which was kindly contributed by our friends from IBM. This version supports multi-GPU training. We will really appreciate any early feedback on this experimental feature (please create new issues, do not comment here).

How to install: https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version-experimental.

Argument to specify number of GPUs: https://lightgbm.readthedocs.io/en/latest/Parameters.html#num_gpu.

StrikerRUS avatar Jan 31 '21 14:01 StrikerRUS

Support ignoring some features during training on constructed dataset #4317

jz457365 avatar Jan 13 '22 03:01 jz457365

Spike and slab feature sampling priors (feature weighted sampling) #2542

jz457365 avatar Jan 13 '22 03:01 jz457365

Quantile LightGBM: ensure monotonic #3447

bethrice44 avatar Apr 26 '22 15:04 bethrice44

SHAP feature contribution for linear trees #4002

robo-sq avatar May 13 '22 17:05 robo-sq

Create dataset from pyarrow tables: #3369

ira-saktor avatar May 25 '22 10:05 ira-saktor

Add support for CRLF line endings or improve documentation and error message #5508

js850 avatar Oct 23 '22 07:10 js850

Add parameter to control maximum group size for Lambdarank  #5053

antaradas94 avatar Dec 20 '22 19:12 antaradas94

Allow training without loading full dataset into memory #5094

chopeen avatar Dec 21 '22 13:12 chopeen