`sky launch` takes ~5s to print out optimizer table, which is slow
sky launch takes ~5s to print out optimizer table, which is slow
- this doesn't include catalog files refresh, which add a few seconds more
Test:
- Based off master fe1452054
- Branch/commit adding a simple cprofile block around
sky launch: https://github.com/skypilot-org/skypilot/commit/bfee932d76b222c133cca222fc66c4e8d665d16b - How to get results: type
sky launch, pressctrl-cas soon as confirmation prompt shows - Results: https://gist.github.com/concretevitamin/8072424ccb90543bac0df159e6222b78
Top few lines
1215743 function calls (1189185 primitive calls) in 4.759 seconds
Ordered by: cumulative time
List reduced from 5569 to 331 due to restriction <'sky'>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 4.758 4.758 /skypilot/sky/cli.py:805(_launch_with_confirm)
1 0.000 0.000 4.351 4.351 /skypilot/sky/optimizer.py:103(optimize)
1 0.000 0.000 4.351 4.351 /skypilot/sky/optimizer.py:934(_optimize_dag)
1 0.001 0.001 4.275 4.275 /skypilot/sky/optimizer.py:231(_estimate_nodes_cost_or_time)
646 0.002 0.000 2.648 0.004 /skypilot/sky/clouds/service_catalog/__init__.py:22(_map_clouds_catalog)
1 0.000 0.000 1.929 1.929 /skypilot/sky/optimizer.py:1144(_fill_in_launchable_resources)
169 0.001 0.000 1.676 0.010 /skypilot/sky/resources.py:944(get_reservations_available_resources)
121 0.000 0.000 1.675 0.014 /skypilot/sky/clouds/gcp.py:936(get_reservations_available_resources)
121 0.000 0.000 1.675 0.014 /skypilot/sky/clouds/utils/gcp_utils.py:121(list_reservations_for_instance_type_in_zone)
1 0.000 0.000 1.674 1.674 /skypilot/sky/clouds/utils/gcp_utils.py:129(_list_reservations_for_instance_type)
1 0.000 0.000 1.674 1.674 /skypilot/sky/utils/subprocess_utils.py:144(run_with_retries)
1 0.001 0.001 1.674 1.674 /skypilot/sky/skylet/log_lib.py:129(run_with_log)
1 0.000 0.000 1.657 1.657 /skypilot/sky/skylet/log_lib.py:105(process_subprocess_stream)
1 0.000 0.000 1.640 1.640 /skypilot/sky/skylet/log_lib.py:53(_handle_io_stream)
6 0.000 0.000 1.388 0.231 /skypilot/sky/clouds/cloud.py:321(get_feasible_launchable_resources)
6 0.000 0.000 1.384 0.231 /skypilot/sky/clouds/service_catalog/__init__.py:180(get_default_instance_type)
1 0.000 0.000 0.788 0.788 /skypilot/sky/clouds/aws.py:419(_get_feasible_launchable_resources)
1 0.000 0.000 0.787 0.787 /skypilot/sky/clouds/aws.py:338(get_default_instance_type)
1 0.000 0.000 0.752 0.752 /skypilot/sky/clouds/service_catalog/aws_catalog.py:215(get_default_instance_type)
102 0.000 0.000 0.731 0.007 /skypilot/sky/clouds/service_catalog/aws_catalog.py:154(_get_df)
1 0.009 0.009 0.731 0.731 /skypilot/sky/clouds/service_catalog/aws_catalog.py:93(_fetch_and_apply_az_mapping)
292 0.020 0.000 0.681 0.002 /skypilot/sky/clouds/service_catalog/common.py:146(_get_instance_type)
1 0.000 0.000 0.676 0.676 /skypilot/sky/clouds/aws.py:595(get_current_user_identity)
168 0.000 0.000 0.626 0.004 /skypilot/sky/resources.py:890(get_cost)
168 0.000 0.000 0.626 0.004 /skypilot/sky/clouds/service_catalog/__init__.py:152(get_hourly_cost)
168 0.001 0.000 0.621 0.004 /skypilot/sky/clouds/service_catalog/common.py:236(get_hourly_cost_impl)
6 0.000 0.000 0.541 0.090 /skypilot/sky/optimizer.py:1086(_make_launchables_for_valid_region_zones)
1 0.000 0.000 0.524 0.524 /skypilot/sky/clouds/oci.py:300(_get_feasible_launchable_resources)
1 0.000 0.000 0.523 0.523 /skypilot/sky/clouds/oci.py:167(get_default_instance_type)
175 0.001 0.000 0.520 0.003 /skypilot/sky/resources.py:1090(copy)
181 0.001 0.000 0.518 0.003 /skypilot/sky/resources.py:48(__init__)
1 0.000 0.000 0.514 0.514 /skypilot/sky/clouds/service_catalog/oci_catalog.py:101(get_default_instance_type)
121 0.000 0.000 0.507 0.004 /skypilot/sky/clouds/gcp.py:274(instance_type_to_hourly_cost)
121 0.000 0.000 0.504 0.004 /skypilot/sky/clouds/service_catalog/gcp_catalog.py:214(get_hourly_cost)
16 0.000 0.000 0.487 0.030 /skypilot/sky/clouds/service_catalog/oci_catalog.py:35(_get_df)
181 0.000 0.000 0.446 0.002 /skypilot/sky/resources.py:553(_validate_and_set_region_zone)
168 0.000 0.000 0.445 0.003 /skypilot/sky/clouds/service_catalog/__init__.py:127(validate_region_zone)
168 0.001 0.000 0.440 0.003 /skypilot/sky/clouds/service_catalog/common.py:166(validate_region_zone_impl)
146 0.000 0.000 0.436 0.003 /skypilot/sky/clouds/cloud.py:463(validate_region_zone)
168 0.019 0.000 0.392 0.002 /skypilot/sky/clouds/service_catalog/common.py:352(_filter_region_zone)
...
Looks like at least a few things to optimize:
- list_reservations_for_instance_type_in_zone
- _fetch_and_apply_az_mapping
- ...
Still relevant in a many-cloud setting. With latest #3280 merged and 6 clouds enabled, sky launch takes ~3.8s on my laptop.
@concretevitamin If there is no active development on this issue, I'd like to take it up.
I just reproduced this issue with 7 providers enabled on my test branch https://github.com/aylei/skypilot/commit/ffdec30999f0553c363912ec6b9eec82f210335b (based off f0ebf13b) and get some thoughts on the solution, correct me if I wrong:
> sky check
...
🎉 Enabled clouds 🎉
✔ AWS
✔ Azure
✔ Cudo
✔ GCP
✔ Kubernetes
✔ Paperspace
✔ RunPod
> SKYPILOT_TIMELINE_FILE_PATH=timeline.json sky launch --gpus H100:8
Considered resources (1 node):
-------------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
-------------------------------------------------------------------------------------------------------
RunPod 8x_H100_SECURE 128 640 H100:8 CA 35.92 ✔
Paperspace H100x8 128 640 H100:8 East Coast (NY2) 47.60
GCP a3-highgpu-8g 208 1872 H100:8 us-central1-a 87.83
AWS p5.48xlarge 192 2048 H100:8 us-east-1 98.32
-------------------------------------------------------------------------------------------------------
Aborted!
> head -15 profile_stats.txt
2225298 function calls (2193864 primitive calls) in 3.466 seconds
Ordered by: cumulative time
List reduced from 8052 to 7609 due to restriction <'sky'>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 3.466 3.466 /Users/aylei/repo/skypilot-org/skypilot/sky/cli.py:557(_launch_with_confirm)
38/2 0.000 0.000 3.315 1.658 /Users/aylei/repo/skypilot-org/skypilot/sky/utils/common_utils.py:376(_record)
1 0.000 0.000 3.313 3.313 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:108(optimize)
1 0.000 0.000 3.311 3.311 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:993(_optimize_dag)
1 0.000 0.000 3.302 3.302 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:240(_estimate_nodes_cost_or_time)
1 0.000 0.000 3.085 3.085 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:1257(_fill_in_launchable_resources)
7 0.000 0.000 3.008 0.430 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/cloud.py:371(get_feasible_launchable_resources)
136 0.000 0.000 2.720 0.020 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/service_catalog/__init__.py:21(_map_clouds_catalog)
6 0.000 0.000 2.426 0.404 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/service_catalog/__init__.py:247(get_instance_type_for_accelerator)
According to the stats (profile_stats.txt) _fill_in_launchable_resources takes 3.085s out of 3.466s. Focusing this particular function in timeline events (timeline.json), the main bottleneck I've identified is that the get_feasible_launchable_resources method of each cloud is processed in serial:
> cat timeline.json| jq -r '.traceEvents.[] | "\(.name), \(.ph): \(.ts)"' | grep -v get_df
sky.backends.backend_utils.refresh_cluster_status_handle, B: 1734446409752685.000
sky.backends.backend_utils.refresh_cluster_status_handle, E: 1734446409754911.000
sky.optimizer.Optimizer.optimize, B: 1734446409755096.000
sky.optimizer._fill_in_launchable_resources, B: 1734446409757335.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B: 1734446409757601.000
sky.clouds.kubernetes.Kubernetes._get_feasible_launchable_resources, B: 1734446410329253.250
sky.clouds.kubernetes.Kubernetes._get_feasible_launchable_resources, E: 1734446410337647.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E: 1734446410337653.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B: 1734446410337664.000
sky.clouds.gcp.GCP._get_feasible_launchable_resources, B: 1734446410337706.000
sky.clouds.gcp.GCP._get_feasible_launchable_resources, E: 1734446410634607.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E: 1734446410634616.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B: 1734446410683903.000
sky.clouds.azure.Azure._get_feasible_launchable_resources, B: 1734446410683945.000
sky.clouds.azure.Azure._get_feasible_launchable_resources, E: 1734446410696483.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E: 1734446410696488.750
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B: 1734446410696500.000
sky.clouds.cudo.Cudo._get_feasible_launchable_resources, B: 1734446410696539.000
sky.clouds.cudo.Cudo._get_feasible_launchable_resources, E: 1734446411555284.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E: 1734446411555307.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B: 1734446411555342.250
sky.clouds.aws.AWS._get_feasible_launchable_resources, B: 1734446411555458.000
sky.clouds.service_catalog.aws_catalog._fetch_and_apply_az_mapping, B: 1734446411561200.750
sky.clouds.service_catalog.aws_catalog._fetch_and_apply_az_mapping, E: 1734446412795312.250
sky.clouds.aws.AWS._get_feasible_launchable_resources, E: 1734446412810339.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E: 1734446412810345.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B: 1734446412831575.250
sky.clouds.runpod.RunPod._get_feasible_launchable_resources, B: 1734446412831607.000
sky.clouds.runpod.RunPod._get_feasible_launchable_resources, E: 1734446412834569.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E: 1734446412834573.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B: 1734446412839013.000
sky.clouds.paperspace.Paperspace._get_feasible_launchable_resources, B: 1734446412839048.000
sky.clouds.paperspace.Paperspace._get_feasible_launchable_resources, E: 1734446412841263.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E: 1734446412841266.750
sky.optimizer._fill_in_launchable_resources, E: 1734446412842773.000
sky.optimizer.Optimizer.optimize, E: 1734446413068050.000
Looking at the individual providers, the slowest one is AWS. get_user_identities constantly takes >1 second in my 5 runs. Meanwhile, the _unsupported_features_for_resources of Kubernetes involves RPC calls to list Nodes, which would have unpredictable latency (in my test, I used a local Kind cluster on my laptop, so the network latency is stable and minimal).
According to above investigation, I'd like to break down this issue into several sub issues:
-
make
get_feasible_launchable_resourcesconcurrent, which I think is the most crucial part as it affects the scalability in multi-cloud scenario; -
dive into the
get_feasible_launchable_resourcesimpl of each provider and investigate whether there are available optimizations, especially for AWS and K8S; -
~~(for long-term observability) replace the home-made timeline.Event with trace standards of OpenTelemetry, so that the performance can be analyzed and tracked more easily.~~ UPDATE(2024-12-20): My previous understanding is wrong, just read the standards timeline.py follows and found it can be visualized by various tools. It is nice and we really need to document it in our dev guide~
@concretevitamin @Michaelvll would you please kindly review this approach?
That sounds great @aylei! Thanks for sharing the detailed analysis. Let's work on the proposals you mentioned. : )
That sounds great @aylei! Thanks for sharing the detailed analysis. Let's work on the proposals you mentioned. : )
Thanks @Michaelvll! I will push forward.
With https://github.com/skypilot-org/skypilot/pull/4483 close to merge, I'd like to share my thoughts on the aws and k8s part.
As far as I understand, in order to build an accurate optimizer table, AWS identity and K8S node info are necessary. Since they are now fetched via RPC in parallel, let's focus on the slower AWS call (which is the bottle in parallelization, cost ~1s). An natural idea is to cache the result between sky launch calls in sqlite or local filesystem. However, I think there is no reasonable cache key option. A possible solution is to hash the output of aws configure list as the cache key:
$ aws configure list
Name Value Type Location
---- ----- ---- --------
profile <not set> None None
access_key ****************ISWR shared-credentials-file
secret_key ****************5EEV shared-credentials-file
region us-west-2 config-file ~/.aws/config
$ aws configure list help
NAME
list -
DESCRIPTION
Lists the profile, access key, secret key, and region configuration
information used for the specified profile. For each configuration
item, it shows the value, where the configuration value was retrieved,
and the configuration variable name.
For example, if you provide the AWS region in an environment variable,
this command shows you the name of the region you've configured, that
this value came from an environment variable, and the name of the
environment variable.
For temporary credential methods such as roles and IAM Identity Center,
this command displays the temporarily cached access key and secret
access key is displayed.
This works since changes of aws credentials would finally reflected in the output and the last 8 letters of the key displayed indicate rare collision rate. However, this is not an authoritative solution and I don't think we can expect the semantics of this command to be stable. Personally I think the current approach is fine and the above optimization does not payoff.
I think I am stuck here, what do you recommend? @Michaelvll Any thoughts would be highly appreciated!
As for k8s, node info can be cached with a TTL, however, since node info is highly dynamic (a hypothetical but reasonable case is when sky launch does not find any node fit, the user may modify labels of existing nodes or add a new node to the cluster and try again). Therefore, a command-line-flag to refresh the cache would be a must-have, by which the added mental burden would beat the performance gain in my point of view.
Hi @aylei, thanks for the analysis. Actually, it seems like using the aws configure list is not that bad idea, as get-caller-identity is actually on the critical path of many of SkyPilot operations, such as sky status, sky exec, if the caching can save us 1s on many of the operations, it might be good to try. : )
Hi @aylei, thanks for the analysis. Actually, it seems like using the
aws configure listis not that bad idea, asget-caller-identityis actually on the critical path of many of SkyPilot operations, such assky status,sky exec, if the caching can save us 1s on many of the operations, it might be good to try. : )
Sounds good! I will investigate it.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
This issue was closed because it has been stalled for 10 days with no activity.