About this PR

Features and Fixes

Introduce an abstract XGBoost Estimator:
- Create an abstract base class that encapsulates common functionality for all XGBoost estimators.
- This will promote code reuse and maintain consistency across different estimators.
Update to the latest XGBoost parameters
- Add all XGBoost parameters supported in XGBoost4j-spark base on this
- Add setter and getter for these parameters.
- Remove the deprecated parameters
Address the missing value handling:
- 0 is a meaningful value in machine learning case, the old version enforces 0 as the missing value which may result in XGBoost model inaccurate.
Remove any ETL operations in XGBoost
- As a thin wrapper of XGBoost4j, it shouldn't wrap much ETL in XGBoost4j itself to make code hard to maintain and read.
Rework the GPU plugin:
- Rework the GPU plugin to improve its stability and make it more readable and maintainable according to the new design pattern of xgboost4j-spark
Expand sanity tests for CPU and GPU consistency:
- Increase the coverage of sanity tests to verify that the training and transform results are consistent between CPU and GPU implementations.
- This will help identify and prevent any discrepancies or bugs related to the GPU support.

Breaking changes

Remove some unused parameters like Rabit related parameters and train_test_ratio. etc.

Separate XGBoost-spark parameter from the whole XGBoost parameters, and make the constructor of XGBoost estimator only accept the XGBoost parameters instead of parameters for xgboost4j-spark. Eg,

val xgbParams: Map[String, Any] = Map(
      "max_depth" -> 5,
      "eta" -> 0.2,
      "objective" -> "binary:logistic"
    )
val est = new XGBoostClassifier(xgbParams)
      .setNumRound(10)
      .setDevice("cpu")
      .setNumWorkers(2)

or

val est = new XGBoostClassifier(xgbParams)
      .setMaxDepth(5)
      .setEta(0.2)
      .setObjective("binary:logistic")
      .setNumRound(10)
      .setDevice("cpu")
      .setNumWorkers(2)

Test this PR

Unit tests For scala 2.12

mvn clean package
mvn -P gpu clean package

Unit tests For scala 2.13

cd xgboost
python dev/change_scala_version.py --scala-version 2.13

cd jvm-packages
mvn clean package
mvn -P gpu clean package

Jul 25 '24 23:07 wbo4958

Hi @trivialfis @hcho3 @dotbg, Could you kindly help review this PR, Thx very much.

Jul 26 '24 01:07 wbo4958

Hi there, The CI got passed, could you @trivialfis @hcho3 @dotbg help review this PR? Thx very much.

Jul 29 '24 01:07 wbo4958

Apologies for the delay, I will look at it this Friday.

Jul 31 '24 21:07 hcho3

Also, we might need to schedule an online interview, @hcho3 might want to join as well.

Yes, I would love to join.

Aug 02 '24 21:08 hcho3

Some questions about build artifacts and packaging:

Do we continue to build two variants of libxgboost4j.so, one with CUDA and one without?
Now that xgboost4j-gpu package is gone, where to we put libxgboost4j.so (CUDA variant)? Does it go inside the xgboost4j-spark-gpu package?
How long do we plan to distribute Spark 2.12 and 2.13 variants?

Aug 02 '24 22:08 hcho3

@trivialfis Do you have a particular agenda in mind for the online interview? Would it be primarily a Q&A session? I can set up an Outlook invite to choose the best time for all of us.

Aug 02 '24 22:08 hcho3

For breaking changes PR, I am mostly concerned about the user interface, hence the request for documentation:

How does the service loader work. What's the sequence of events for a CPU pipeline and a GPU pipeline. Also, the combination of rapids SQL and XGBoost parameters (4 cases with on/off for both sides).
What are the necessary changes from users, can we submit deprecation warnings, or can we submit informative errors, or just hard break.
The sequence of events when a sparse vector is used. The memory usage profile of various types of data.
What are the removed ETL, how does it change user's code.
What migration guide is needed.
What are the upcoming changes.
What are the difference between scala and pyspark.
how well does it integrate with SparkML training utilities like CV.
Any design plan for external memory? It's taking shape in the lastest XGBoost.
Any plan for resilient?

And we will need to have a clear picture for both CPU and GPU. As for internal code structure, i don't have any concern. It looks readable to me and I'm fairly sure that @wbo4958 does a better job than me for JVM based projects.

If there's no way to even submit an error for breaking change, should we publish a different package on maven?

Aug 03 '24 13:08 trivialfis

Sorry for delay, I was on vacation these days.

Aug 04 '24 00:08 wbo4958

Some questions about build artifacts and packaging:

Do we continue to build two variants of libxgboost4j.so, one with CUDA and one without?

[Bobby] Yes, we still need to build tow libxgboost4j.so

Now that xgboost4j-gpu package is gone, where to we put libxgboost4j.so (CUDA variant)? Does it go inside the xgboost4j-spark-gpu package?

[Bobby] the libxgboost4j.so with cuda will reside in xgboost4j package. xgboost4j-spark-gpu now depends on xgboost4j/xgboost4j-spark

How long do we plan to distribute Spark 2.12 and 2.13 variants?

[Bobby] what do you mean how long?

Aug 04 '24 00:08 wbo4958

@trivialfis @hcho3, yeah, let's have a online review session.

Aug 04 '24 00:08 wbo4958

libxgboost4j.so with cuda will reside in xgboost4j package

CPU users will complain about the size of xgboost4j package if we choose this route.

Aug 04 '24 01:08 hcho3

libxgboost4j.so with cuda will reside in xgboost4j package

CPU users will complain about the size of xgboost4j package if we choose this route.

Not exactly. When compiling the CPU versions, there's no cuda things in libxgboost4j.so which will be put into public/nightly places. But for GPU version, we first compile libxgboost4j.so with CUDA and put it into xgboost4j package, but we don't distribute this xgboost4j to the maven repo, instead, we will only distribute xgboost4j-spark-gpu package to the maven, which is a uber package including xgboost4j/xgboost4j-spark.

Aug 05 '24 01:08 wbo4958

Hey @trivialfis @hcho3, CI passed, Could you help review it? Thx

Aug 08 '24 05:08 wbo4958

Could you please help address the implication for the following items:

Remove any ETL operations in XGBoost

For example,

ETL includes "caching dataset", and collecting group for ranking issue.

Aug 09 '24 08:08 wbo4958

@WeichenXu123 @rongou @Craigacp It would be great if we could get some help reviewing the PR when you are available.

Aug 09 '24 09:08 trivialfis

I haven't written any Spark or Scala code since Spark 1.4 so I don't think I'll have much ability to review this PR.

Aug 10 '24 21:08 Craigacp

I've started reviewing the spark-related code, but it takes a while

Aug 10 '24 23:08 dotbg

Hi @hcho3 @trivialfis, Could you help review it again. Thx

Aug 22 '24 03:08 wbo4958

Hi @trivialfis, I've fixed all the comments.

Aug 28 '24 03:08 wbo4958

@wbo4958 Where is the definition of xgboost ranker?

Aug 28 '24 19:08 trivialfis

@trivialfis, I will add it in the following PR, I didn't want to blow up this PR.

Aug 28 '24 22:08 wbo4958

For ranking task, I think the library should make sure the records with the same group column go to the same task, otherwise it will impact the model performance as we describe in 9491 . It looks like this PR didn't fix this issue yet.

Sep 04 '24 08:09 jinmfeng

I will look into the ranking later. In theory, as long as the partitions for each local workers are sorted, the gradient is considered correct.

Sep 05 '24 03:09 trivialfis

Huge thanks to @dotbg @eordentlich for the reviews and for catching some issues early. I wouldn't be able to do that myself. Could you please help approve the PR for spark hanlding?

I don't have any more comments about logic related to XGBoost itself.

Sep 10 '24 06:09 trivialfis

Hi @trivialfis, could you help merge it?

Sep 11 '24 07:09 wbo4958

[jvm-packages] [breaking] rework xgboost4j-spark and xgboost4j-spark-gpu

About this PR

Features and Fixes

Breaking changes

Test this PR