xgboost Xgboost pyspark: Support pyspark Sparse Vector

Xgboost pyspark: Support pyspark Sparse Vector

Open WeichenXu123 opened this issue 2 years ago • 4 comments

Xgboost pyspark: Support pyspark Sparse Vector

Jul 22 '22 11:07 WeichenXu123

@trivialfis @mengxr hopes xgboost 2.0 can support this on databricks. The feature requires a unwrap_udt API which now only available on databricks. Can we add this feature now ? I will checking whether it is on databricks environment and if yes then imports the databricks-specific unwrap_udt function.

Btw, when apache/spark 3.4 comes out, unwrap_udt API will be added and then apache/spark can also support this feature.

Jul 22 '22 11:07 WeichenXu123

Instead of checking whether it is on Databricks, we should just test if this method exists.

Jul 22 '22 15:07 mengxr

@trivialfis

Can we add this feature now ? In the code it can check whether it is on databricks environment and if yes then import the databricks-specific unwrap_udt function.

What do you think ?

Jul 26 '22 10:07 WeichenXu123

https://github.com/dmlc/xgboost/issues/8108#issuecomment-1192705836 sounds good

Jul 26 '22 19:07 trivialfis

hi @WeichenXu123 is it possible for us to use mapInArrow instead of mapInPandas? What's the current status of the API and does it support device memory? https://arrow.apache.org/docs/cpp/memory.html#devices

Aug 11 '22 09:08 trivialfis

hi @WeichenXu123 is it possible for us to use mapInArrow instead of mapInPandas? What's the current status of the API and does it support device memory? https://arrow.apache.org/docs/cpp/memory.html#devices

CC @HyukjinKwon

Aug 11 '22 10:08 WeichenXu123

Got reply from @HyukjinKwon:

it should support although nobody tested it because it exposes the native PyArrow batch instances if that supports, it should work

We can try it as a follow-up optimization.

Aug 11 '22 10:08 WeichenXu123

xgboost xgboost copied to clipboard

Xgboost pyspark: Support pyspark Sparse Vector

xgboost
xgboost copied to clipboard