LightGBM
LightGBM copied to clipboard
[RFC] remove HDFS support?
Proposal
I'm requesting comment on the following proposal:
- remove CMake option
USE_HDFS
- remove all docs and code related to HDFS
And doing all of these only after issuing build-time and run-time deprecation warnings for 1 release.
Summary
6+ years ago, #1243 was merged, adding "experimental" support for Hadoop Filesystem ("HDFS") to LightGBM.
Specifically, that meant adding the ability to provide an HDFS URI to Dataset creation methods that accept files, and have LightGBM read the file from HDFS.
Since then:
- we've received very very few issues / PRs related to that support
- 0 tests have ever been run on that build of the library
- XGBoost has removed their equivalent support (https://github.com/dmlc/xgboost/pull/9504)
I don't believe SynapseML depends on the HDFS build of LightGBM (GitHub search). Hopefully @imatiach-msft or @mhamilton723 can confirm.
Searching the rest of GitHub (GitHub search), I just see other forks of LightGBM and two recipes from repackages:
Motivation
Simplifies the project's public interface (build system options, installation docs, etc.).
Simplifies file-reading C++ code, making it easier to understand and refactor in the future.
Here in 2024, users have other and better options for using LightGBM distributed training on data in HDFS, like:
-
lightgbm.dask
(Dask docs on Hadoop) - Spark via
SynapseML
(https://github.com/microsoft/SynapseML)
@jameslamb Thank you for referring. For Macports: Since we do not enable this option presently, removing it won’t break anything. So we are good.
I'm all for this change, simplification is always good and if we neither have tests nor indication that nobody has used this (extensively), that's even more of an argument :+1:
Thanks for the heads up. will remove the arg from nixpkgs when updating for 4.4.0