Kai-Hsun Chen comments

Results 327 comments of


                                            Kai-Hsun Chen

trafficstars

[Bug] RayJob falsely marked as "Running" when driver fails

Closed by https://github.com/ray-project/kuberay/pull/2579

[Bug] RayJob does not work when `app.kubernetes.io/name` is set

Closed by #2166

[RayTrain] Checkpoint API to recover from checkpoint from previous runs

> It would be great if we have an API that we can call and get the latest checkpoint location for the previous iteration of the given run. Do you...

[core][experimental] Higher than expected overhead for shared memory channels with NCCL

@stephanie-wang I will take a look at this issue next week. Would you mind pointing me out which tests are you referring to? Thanks!

[core][experimental] Higher than expected overhead for shared memory channels with NCCL

I ran [accelerated_dag_gpu_microbenchmark.py](https://github.com/ray-project/ray/blob/master/release/microbenchmark/experimental/accelerated_dag_gpu_microbenchmark.py) on my GPU machine, and got the following results: ``` exec_nccl_gpu per second 5798.01 +- 16.49 exec_ray_dag_gpu_nccl_static_shape_direct_return per second 3041.9 +- 3.4 ``` * `benchmark_nccl.py` uses the...

[core][experimental] Higher than expected overhead for shared memory channels with NCCL

The main performance difference between pure NCCL and RayCG is due to benchmarking two different aspects. * For pure NCCL, we only measure the time for NCCL I/O. * For...

[core][experimental] Higher than expected overhead for shared memory channels with NCCL

I documented the process of verifying the calculation in https://github.com/ray-project/ray/pull/48860#issuecomment-2512348571. Closing this issue. @stephanie-wang, feel free to reopen the issue if I missed anything.

[Feature] Configurable RayCluster readiness definition

https://github.com/ray-project/kuberay/issues/533

[Feature] Configurable RayCluster readiness definition

Chatted with @rueian today. Currently, we redefine "ready" with a new RayCluster condition called `RayClusterReady`. This condition indicates whether all Ray Pods are ready when the RayCluster is first created....

[RayJob] Add spec.backoffLimit for retrying RayJobs with new clusters

I just had a quick glance, and I think we should reuse the code path of the existing state machine. To elaborate, we can add a new state `rayv1.JobDeploymentStatusRestarting`, which...