[jvm-packages] [breaking] rework xgboost4j-spark and xgboost4j-spark-gpu
About this PR
Features and Fixes
-
Introduce an abstract XGBoost Estimator:
- Create an abstract base class that encapsulates common functionality for all XGBoost estimators.
- This will promote code reuse and maintain consistency across different estimators.
-
Update to the latest XGBoost parameters
- Add all XGBoost parameters supported in XGBoost4j-spark base on this
- Add setter and getter for these parameters.
- Remove the deprecated parameters
-
Address the missing value handling:
- 0 is a meaningful value in machine learning case, the old version enforces 0 as the missing value which may result in XGBoost model inaccurate.
-
Remove any ETL operations in XGBoost
- As a thin wrapper of XGBoost4j, it shouldn't wrap much ETL in XGBoost4j itself to make code hard to maintain and read.
-
Rework the GPU plugin:
- Rework the GPU plugin to improve its stability and make it more readable and maintainable according to the new design pattern of xgboost4j-spark
-
Expand sanity tests for CPU and GPU consistency:
- Increase the coverage of sanity tests to verify that the training and transform results are consistent between CPU and GPU implementations.
- This will help identify and prevent any discrepancies or bugs related to the GPU support.
Breaking changes
-
Remove some unused parameters like Rabit related parameters and train_test_ratio. etc.
-
Separate XGBoost-spark parameter from the whole XGBoost parameters, and make the constructor of XGBoost estimator only accept the XGBoost parameters instead of parameters for xgboost4j-spark. Eg,
val xgbParams: Map[String, Any] = Map( "max_depth" -> 5, "eta" -> 0.2, "objective" -> "binary:logistic" ) val est = new XGBoostClassifier(xgbParams) .setNumRound(10) .setDevice("cpu") .setNumWorkers(2)or
val est = new XGBoostClassifier(xgbParams) .setMaxDepth(5) .setEta(0.2) .setObjective("binary:logistic") .setNumRound(10) .setDevice("cpu") .setNumWorkers(2)
Test this PR
- Unit tests For scala 2.12
mvn clean package
mvn -P gpu clean package
- Unit tests For scala 2.13
cd xgboost
python dev/change_scala_version.py --scala-version 2.13
cd jvm-packages
mvn clean package
mvn -P gpu clean package
Hi @trivialfis @hcho3 @dotbg, Could you kindly help review this PR, Thx very much.
Hi there, The CI got passed, could you @trivialfis @hcho3 @dotbg help review this PR? Thx very much.
Apologies for the delay, I will look at it this Friday.
Also, we might need to schedule an online interview, @hcho3 might want to join as well.
Yes, I would love to join.
Some questions about build artifacts and packaging:
- Do we continue to build two variants of
libxgboost4j.so, one with CUDA and one without? - Now that
xgboost4j-gpupackage is gone, where to we putlibxgboost4j.so(CUDA variant)? Does it go inside thexgboost4j-spark-gpupackage? - How long do we plan to distribute Spark 2.12 and 2.13 variants?
@trivialfis Do you have a particular agenda in mind for the online interview? Would it be primarily a Q&A session? I can set up an Outlook invite to choose the best time for all of us.
For breaking changes PR, I am mostly concerned about the user interface, hence the request for documentation:
- How does the service loader work. What's the sequence of events for a CPU pipeline and a GPU pipeline. Also, the combination of rapids SQL and XGBoost parameters (4 cases with on/off for both sides).
- What are the necessary changes from users, can we submit deprecation warnings, or can we submit informative errors, or just hard break.
- The sequence of events when a sparse vector is used. The memory usage profile of various types of data.
- What are the removed ETL, how does it change user's code.
- What migration guide is needed.
- What are the upcoming changes.
- What are the difference between scala and pyspark.
- how well does it integrate with SparkML training utilities like CV.
- Any design plan for external memory? It's taking shape in the lastest XGBoost.
- Any plan for resilient?
And we will need to have a clear picture for both CPU and GPU. As for internal code structure, i don't have any concern. It looks readable to me and I'm fairly sure that @wbo4958 does a better job than me for JVM based projects.
If there's no way to even submit an error for breaking change, should we publish a different package on maven?
Sorry for delay, I was on vacation these days.
Some questions about build artifacts and packaging:
- Do we continue to build two variants of
libxgboost4j.so, one with CUDA and one without?
[Bobby] Yes, we still need to build tow libxgboost4j.so
- Now that
xgboost4j-gpupackage is gone, where to we putlibxgboost4j.so(CUDA variant)? Does it go inside thexgboost4j-spark-gpupackage?
[Bobby] the libxgboost4j.so with cuda will reside in xgboost4j package. xgboost4j-spark-gpu now depends on xgboost4j/xgboost4j-spark
- How long do we plan to distribute Spark 2.12 and 2.13 variants?
[Bobby] what do you mean how long?
@trivialfis @hcho3, yeah, let's have a online review session.
libxgboost4j.so with cuda will reside in xgboost4j package
CPU users will complain about the size of xgboost4j package if we choose this route.
libxgboost4j.so with cuda will reside in xgboost4j package
CPU users will complain about the size of xgboost4j package if we choose this route.
Not exactly. When compiling the CPU versions, there's no cuda things in libxgboost4j.so which will be put into public/nightly places. But for GPU version, we first compile libxgboost4j.so with CUDA and put it into xgboost4j package, but we don't distribute this xgboost4j to the maven repo, instead, we will only distribute xgboost4j-spark-gpu package to the maven, which is a uber package including xgboost4j/xgboost4j-spark.
Hey @trivialfis @hcho3, CI passed, Could you help review it? Thx
Could you please help address the implication for the following items:
- Remove any ETL operations in XGBoost
For example,
ETL includes "caching dataset", and collecting group for ranking issue.
@WeichenXu123 @rongou @Craigacp It would be great if we could get some help reviewing the PR when you are available.
I haven't written any Spark or Scala code since Spark 1.4 so I don't think I'll have much ability to review this PR.
I've started reviewing the spark-related code, but it takes a while
Hi @hcho3 @trivialfis, Could you help review it again. Thx
Hi @trivialfis, I've fixed all the comments.
@wbo4958 Where is the definition of xgboost ranker?
@trivialfis, I will add it in the following PR, I didn't want to blow up this PR.
For ranking task, I think the library should make sure the records with the same group column go to the same task, otherwise it will impact the model performance as we describe in 9491 . It looks like this PR didn't fix this issue yet.
I will look into the ranking later. In theory, as long as the partitions for each local workers are sorted, the gradient is considered correct.
Huge thanks to @dotbg @eordentlich for the reviews and for catching some issues early. I wouldn't be able to do that myself. Could you please help approve the PR for spark hanlding?
I don't have any more comments about logic related to XGBoost itself.
Hi @trivialfis, could you help merge it?