Zhanghao Wu comments

Results 315 comments of


                                            Zhanghao Wu

Add Imagenet Benchmark

> * Could you also post some stats on the dataset? Number of files, size of each file, directory hierarchy (if possible)? Great suggestions! Added the information about the dataset...

Occasionally fail to launch azure cluster

After digging into the problem, it seems that the problem is caused by [these lines](https://github.com/ray-project/ray/blob/5ea565317a8104c04ae7892bb9bb41c6d72f12df/python/ray/autoscaler/_private/commands.py#L641-L649) of ray, where `ray` thinks the head node launching failed due to the timeout (50...

Optimizing & Provisioning Retries at the granularity of regions/zones

Thanks for the fantastic and quick work! Just scanned the PR and have several questions: 1. What are the bugs fixed in this PR? It would be nice to have...

[Ray Autoscaler] `ray status`, Accelerator Placement Group

Did you launch the task with `sky launch` with the generated ray program? For ray resource allocation, it is a book keeping system. Despite `num_gpus=2` for a pg, you may...

Gracefully handle long cluster names & for spot, show FAILED_CONTROLLER instead

`test_id` was added to avoid the second smoke test reusing the cluster launched in the first smoke test and forgot to be terminated. Also, for the `test_spot_recovery`, we have to...

Gracefully handle long cluster names & for spot, show FAILED_CONTROLLER instead

Setting `FAILED_NO_RESOURCE` for the long cluster name could be a bit tricky, as when no clouds are specified, it is possible that AWS failed due to resource unavailable and the...

Too many `ray job` relatted commands block newly arrived `ray job`

> We can try to directly query against redis. It's a hack as it assumes ray job internals (redis key formats, etc.). Good point! I am thinking that we can...

Too many `ray job` relatted commands block newly arrived `ray job`

I asked them to remove the `JobUpdateEvent` from the skylet and restart the controller, and it seems that it is now able to run for about 2 hours and is...

Too many `ray job` relatted commands block newly arrived `ray job`

It seems that the ray's job_manager is buggy. Even though they call `ray.actor.exit_actor()` for the `JobSupervisor.run` and actor will not be able to find using `ray.get_actor` with ValueError. The actor...

Too many `ray job` relatted commands block newly arrived `ray job`

Just found another easier way to reproduce the error: ``` for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 800; echo bye'; sleep...