Zhi Lin

Results 84 comments of Zhi Lin

You can try to use `sdf.mapInPandas` instead of rdd flatmap. Here is a [doc](https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html). This step is quite similar to the `to_spark` function in the previously mentioned PR(arrow table and...

Notice that `x` passed to `map_func` would be a iterator of pandas dataframe. If not clear, please search `mapInPandas` in the [doc](https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html). The function should look like this one in...

> I wonder what implications I might encounter without using Dataset as the intermediary. I think there is not big difference. We need to use `to_pandas` because the data stored...

When printing available resources, have all executor actors started? It might take some time. Does all nodes have the same resources(at least CPU and memory)?

hmm, I think you can use 2GB memory per core, so that you can use all cores on your cluster. If that's not enough for your workload, then you have...

Hi @gbraes , we are very excited to see that you have tried many pipelines with raydp! We have never tested raydp with kafka though. Can you please give a...

Hi @gbraes , glad you made it work! This is quite strange, since 0.4.2 just added support for ray 1.11 and 1.12. There are no major changes. We'll look into...

This might be a good idea! Thanks for your advice. I have a concern that ray dataset use arrow format while spark dataframe use its own format, though. But we'll...

hi @Hoeze, applyInPandas will start python workers, and these workers are not connected to ray. Actor itself is a process, so it's not quite possible to 'reuse' its session. In...

Sorry, this file is pretty stale now, we did not update it for a long time. Please refer to [here](https://docs.ray.io/en/latest/cluster/kubernetes/index.html) to start a ray cluster on k8s.