xgboost
xgboost copied to clipboard
Xgboost pyspark: Support pyspark Sparse Vector
Xgboost pyspark: Support pyspark Sparse Vector
@trivialfis
@mengxr hopes xgboost 2.0 can support this on databricks.
The feature requires a unwrap_udt
API which now only available on databricks.
Can we add this feature now ? I will checking whether it is on databricks environment and if yes then imports the databricks-specific unwrap_udt
function.
Btw, when apache/spark 3.4 comes out, unwrap_udt
API will be added and then apache/spark can also support this feature.
Instead of checking whether it is on Databricks, we should just test if this method exists.
@trivialfis
Can we add this feature now ? In the code it can check whether it is on databricks environment and if yes then import the databricks-specific unwrap_udt function.
What do you think ?
https://github.com/dmlc/xgboost/issues/8108#issuecomment-1192705836 sounds good
hi @WeichenXu123 is it possible for us to use mapInArrow
instead of mapInPandas
? What's the current status of the API and does it support device memory? https://arrow.apache.org/docs/cpp/memory.html#devices
hi @WeichenXu123 is it possible for us to use
mapInArrow
instead ofmapInPandas
? What's the current status of the API and does it support device memory? https://arrow.apache.org/docs/cpp/memory.html#devices
CC @HyukjinKwon
Got reply from @HyukjinKwon:
it should support although nobody tested it because it exposes the native PyArrow batch instances if that supports, it should work
We can try it as a follow-up optimization.