Zhanghao Wu

Results 209 issues of Zhanghao Wu

Our autostop will only consider the run section, but the setup section can take very long and the cluster can be stopped even if the setup is still running.

P1
Initial-User-Issue

Currently, we cache the spot job status locally whenever the user called `sky spot status`, which will return a bit confusing job table as some of the jobs may still...

enhancement

Although `aws s3 ls s3://imagenet-bucket` can list the files in the bucket (not owned by me), when mounting the bucket with the following YAML, raises the AccessDenied Error. ```yaml resources:...

bug

I am testing transferring a tar file from a bucket to ebs. When using the `aws s3 cp`, the throughput keeps around 200MB/s. When I mount the bucket with our...

We may want to expose our python APIs, make them easier to use, and add them to our document. It could be related to #871.

Initial-User-Issue
feature-request

Some inference tasks can be run on multiple different resources, e.g. V100:8, V100:4, or V100:1. Based on the availability, the user wants the sky to failover from 8 to 1....

Initial-User-Issue
feature-request

To test out how well goofys performs for a real deep learning workload, I made this benchmark. ## ImageNet Dataset Information ### Stats * 1M training images, and 50K val...

A GCP user with no IAM setting permission (the permission in the following figure) will not be able to get setIAMPolicy for the ray-autoscaler service IAM and cause the following...

Our current optimizer does not consider regions. That makes the data movement between different regions not taken into consideration, especially when we are trying to have chain dags or recovery...

enhancement

It would be great to have a tutorial in our document for using TPU nodes and TPU VMs in our document. Currently, we only have a single line in the...

P0