xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

[jvm-packages] [breaking] rework xgboost4j-spark and xgboost4j-spark-gpu

Open wbo4958 opened this issue 1 year ago • 20 comments

About this PR

Features and Fixes

  • Introduce an abstract XGBoost Estimator:

    • Create an abstract base class that encapsulates common functionality for all XGBoost estimators.
    • This will promote code reuse and maintain consistency across different estimators.
  • Update to the latest XGBoost parameters

    • Add all XGBoost parameters supported in XGBoost4j-spark base on this
    • Add setter and getter for these parameters.
    • Remove the deprecated parameters
  • Address the missing value handling:

    • 0 is a meaningful value in machine learning case, the old version enforces 0 as the missing value which may result in XGBoost model inaccurate.
  • Remove any ETL operations in XGBoost

    • As a thin wrapper of XGBoost4j, it shouldn't wrap much ETL in XGBoost4j itself to make code hard to maintain and read.
  • Rework the GPU plugin:

    • Rework the GPU plugin to improve its stability and make it more readable and maintainable according to the new design pattern of xgboost4j-spark
  • Expand sanity tests for CPU and GPU consistency:

    • Increase the coverage of sanity tests to verify that the training and transform results are consistent between CPU and GPU implementations.
    • This will help identify and prevent any discrepancies or bugs related to the GPU support.

Breaking changes

  • Remove some unused parameters like Rabit related parameters and train_test_ratio. etc.

  • Separate XGBoost-spark parameter from the whole XGBoost parameters, and make the constructor of XGBoost estimator only accept the XGBoost parameters instead of parameters for xgboost4j-spark. Eg,

    val xgbParams: Map[String, Any] = Map(
          "max_depth" -> 5,
          "eta" -> 0.2,
          "objective" -> "binary:logistic"
        )
    val est = new XGBoostClassifier(xgbParams)
          .setNumRound(10)
          .setDevice("cpu")
          .setNumWorkers(2)
    

    or

    val est = new XGBoostClassifier(xgbParams)
          .setMaxDepth(5)
          .setEta(0.2)
          .setObjective("binary:logistic")
          .setNumRound(10)
          .setDevice("cpu")
          .setNumWorkers(2)
    

Test this PR

  • Unit tests For scala 2.12
mvn clean package
mvn -P gpu clean package
  • Unit tests For scala 2.13
cd xgboost
python dev/change_scala_version.py --scala-version 2.13

cd jvm-packages
mvn clean package
mvn -P gpu clean package

wbo4958 avatar Jul 25 '24 23:07 wbo4958

Hi @trivialfis @hcho3 @dotbg, Could you kindly help review this PR, Thx very much.

wbo4958 avatar Jul 26 '24 01:07 wbo4958

Hi there, The CI got passed, could you @trivialfis @hcho3 @dotbg help review this PR? Thx very much.

wbo4958 avatar Jul 29 '24 01:07 wbo4958

Apologies for the delay, I will look at it this Friday.

hcho3 avatar Jul 31 '24 21:07 hcho3

Also, we might need to schedule an online interview, @hcho3 might want to join as well.

Yes, I would love to join.

hcho3 avatar Aug 02 '24 21:08 hcho3

Some questions about build artifacts and packaging:

  • Do we continue to build two variants of libxgboost4j.so, one with CUDA and one without?
  • Now that xgboost4j-gpu package is gone, where to we put libxgboost4j.so (CUDA variant)? Does it go inside the xgboost4j-spark-gpu package?
  • How long do we plan to distribute Spark 2.12 and 2.13 variants?

hcho3 avatar Aug 02 '24 22:08 hcho3

@trivialfis Do you have a particular agenda in mind for the online interview? Would it be primarily a Q&A session? I can set up an Outlook invite to choose the best time for all of us.

hcho3 avatar Aug 02 '24 22:08 hcho3

For breaking changes PR, I am mostly concerned about the user interface, hence the request for documentation:

  • How does the service loader work. What's the sequence of events for a CPU pipeline and a GPU pipeline. Also, the combination of rapids SQL and XGBoost parameters (4 cases with on/off for both sides).
  • What are the necessary changes from users, can we submit deprecation warnings, or can we submit informative errors, or just hard break.
  • The sequence of events when a sparse vector is used. The memory usage profile of various types of data.
  • What are the removed ETL, how does it change user's code.
  • What migration guide is needed.
  • What are the upcoming changes.
  • What are the difference between scala and pyspark.
  • how well does it integrate with SparkML training utilities like CV.
  • Any design plan for external memory? It's taking shape in the lastest XGBoost.
  • Any plan for resilient?

And we will need to have a clear picture for both CPU and GPU. As for internal code structure, i don't have any concern. It looks readable to me and I'm fairly sure that @wbo4958 does a better job than me for JVM based projects.

If there's no way to even submit an error for breaking change, should we publish a different package on maven?

trivialfis avatar Aug 03 '24 13:08 trivialfis

Sorry for delay, I was on vacation these days.

wbo4958 avatar Aug 04 '24 00:08 wbo4958

Some questions about build artifacts and packaging:

  • Do we continue to build two variants of libxgboost4j.so, one with CUDA and one without?

[Bobby] Yes, we still need to build tow libxgboost4j.so

  • Now that xgboost4j-gpu package is gone, where to we put libxgboost4j.so (CUDA variant)? Does it go inside the xgboost4j-spark-gpu package?

[Bobby] the libxgboost4j.so with cuda will reside in xgboost4j package. xgboost4j-spark-gpu now depends on xgboost4j/xgboost4j-spark

  • How long do we plan to distribute Spark 2.12 and 2.13 variants?

[Bobby] what do you mean how long?

wbo4958 avatar Aug 04 '24 00:08 wbo4958

@trivialfis @hcho3, yeah, let's have a online review session.

wbo4958 avatar Aug 04 '24 00:08 wbo4958

libxgboost4j.so with cuda will reside in xgboost4j package

CPU users will complain about the size of xgboost4j package if we choose this route.

hcho3 avatar Aug 04 '24 01:08 hcho3

libxgboost4j.so with cuda will reside in xgboost4j package

CPU users will complain about the size of xgboost4j package if we choose this route.

Not exactly. When compiling the CPU versions, there's no cuda things in libxgboost4j.so which will be put into public/nightly places. But for GPU version, we first compile libxgboost4j.so with CUDA and put it into xgboost4j package, but we don't distribute this xgboost4j to the maven repo, instead, we will only distribute xgboost4j-spark-gpu package to the maven, which is a uber package including xgboost4j/xgboost4j-spark.

wbo4958 avatar Aug 05 '24 01:08 wbo4958

Hey @trivialfis @hcho3, CI passed, Could you help review it? Thx

wbo4958 avatar Aug 08 '24 05:08 wbo4958

Could you please help address the implication for the following items:

  • Remove any ETL operations in XGBoost

For example,

ETL includes "caching dataset", and collecting group for ranking issue.

wbo4958 avatar Aug 09 '24 08:08 wbo4958

@WeichenXu123 @rongou @Craigacp It would be great if we could get some help reviewing the PR when you are available.

trivialfis avatar Aug 09 '24 09:08 trivialfis

I haven't written any Spark or Scala code since Spark 1.4 so I don't think I'll have much ability to review this PR.

Craigacp avatar Aug 10 '24 21:08 Craigacp

I've started reviewing the spark-related code, but it takes a while

dotbg avatar Aug 10 '24 23:08 dotbg

Hi @hcho3 @trivialfis, Could you help review it again. Thx

wbo4958 avatar Aug 22 '24 03:08 wbo4958

Hi @trivialfis, I've fixed all the comments.

wbo4958 avatar Aug 28 '24 03:08 wbo4958

@wbo4958 Where is the definition of xgboost ranker?

trivialfis avatar Aug 28 '24 19:08 trivialfis

@trivialfis, I will add it in the following PR, I didn't want to blow up this PR.

wbo4958 avatar Aug 28 '24 22:08 wbo4958

For ranking task, I think the library should make sure the records with the same group column go to the same task, otherwise it will impact the model performance as we describe in 9491 . It looks like this PR didn't fix this issue yet.

jinmfeng avatar Sep 04 '24 08:09 jinmfeng

I will look into the ranking later. In theory, as long as the partitions for each local workers are sorted, the gradient is considered correct.

trivialfis avatar Sep 05 '24 03:09 trivialfis

Huge thanks to @dotbg @eordentlich for the reviews and for catching some issues early. I wouldn't be able to do that myself. Could you please help approve the PR for spark hanlding?

I don't have any more comments about logic related to XGBoost itself.

trivialfis avatar Sep 10 '24 06:09 trivialfis

Hi @trivialfis, could you help merge it?

wbo4958 avatar Sep 11 '24 07:09 wbo4958