Zhanghao Wu
Zhanghao Wu
> * Could you also post some stats on the dataset? Number of files, size of each file, directory hierarchy (if possible)? Great suggestions! Added the information about the dataset...
After digging into the problem, it seems that the problem is caused by [these lines](https://github.com/ray-project/ray/blob/5ea565317a8104c04ae7892bb9bb41c6d72f12df/python/ray/autoscaler/_private/commands.py#L641-L649) of ray, where `ray` thinks the head node launching failed due to the timeout (50...
Thanks for the fantastic and quick work! Just scanned the PR and have several questions: 1. What are the bugs fixed in this PR? It would be nice to have...
Did you launch the task with `sky launch` with the generated ray program? For ray resource allocation, it is a book keeping system. Despite `num_gpus=2` for a pg, you may...
`test_id` was added to avoid the second smoke test reusing the cluster launched in the first smoke test and forgot to be terminated. Also, for the `test_spot_recovery`, we have to...
Setting `FAILED_NO_RESOURCE` for the long cluster name could be a bit tricky, as when no clouds are specified, it is possible that AWS failed due to resource unavailable and the...
> We can try to directly query against redis. It's a hack as it assumes ray job internals (redis key formats, etc.). Good point! I am thinking that we can...
I asked them to remove the `JobUpdateEvent` from the skylet and restart the controller, and it seems that it is now able to run for about 2 hours and is...
It seems that the ray's job_manager is buggy. Even though they call `ray.actor.exit_actor()` for the `JobSupervisor.run` and actor will not be able to find using `ray.get_actor` with ValueError. The actor...
Just found another easier way to reproduce the error: ``` for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 800; echo bye'; sleep...