Kai-Hsun Chen
Kai-Hsun Chen
Closed by https://github.com/ray-project/kuberay/pull/2579
Closed by #2166
> It would be great if we have an API that we can call and get the latest checkpoint location for the previous iteration of the given run. Do you...
@stephanie-wang I will take a look at this issue next week. Would you mind pointing me out which tests are you referring to? Thanks!
I ran [accelerated_dag_gpu_microbenchmark.py](https://github.com/ray-project/ray/blob/master/release/microbenchmark/experimental/accelerated_dag_gpu_microbenchmark.py) on my GPU machine, and got the following results: ``` exec_nccl_gpu per second 5798.01 +- 16.49 exec_ray_dag_gpu_nccl_static_shape_direct_return per second 3041.9 +- 3.4 ``` * `benchmark_nccl.py` uses the...
The main performance difference between pure NCCL and RayCG is due to benchmarking two different aspects. * For pure NCCL, we only measure the time for NCCL I/O. * For...
I documented the process of verifying the calculation in https://github.com/ray-project/ray/pull/48860#issuecomment-2512348571. Closing this issue. @stephanie-wang, feel free to reopen the issue if I missed anything.
https://github.com/ray-project/kuberay/issues/533
Chatted with @rueian today. Currently, we redefine "ready" with a new RayCluster condition called `RayClusterReady`. This condition indicates whether all Ray Pods are ready when the RayCluster is first created....
I just had a quick glance, and I think we should reuse the code path of the existing state machine. To elaborate, we can add a new state `rayv1.JobDeploymentStatusRestarting`, which...