skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

`sky launch` takes ~5s to print out optimizer table, which is slow

Open concretevitamin opened this issue 1 year ago • 1 comments

sky launch takes ~5s to print out optimizer table, which is slow

  • this doesn't include catalog files refresh, which add a few seconds more

Test:

  • Based off master fe1452054
  • Branch/commit adding a simple cprofile block around sky launch: https://github.com/skypilot-org/skypilot/commit/bfee932d76b222c133cca222fc66c4e8d665d16b
  • How to get results: type sky launch, press ctrl-c as soon as confirmation prompt shows
  • Results: https://gist.github.com/concretevitamin/8072424ccb90543bac0df159e6222b78

Top few lines

         1215743 function calls (1189185 primitive calls) in 4.759 seconds

   Ordered by: cumulative time
   List reduced from 5569 to 331 due to restriction <'sky'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.758    4.758 /skypilot/sky/cli.py:805(_launch_with_confirm)
        1    0.000    0.000    4.351    4.351 /skypilot/sky/optimizer.py:103(optimize)
        1    0.000    0.000    4.351    4.351 /skypilot/sky/optimizer.py:934(_optimize_dag)
        1    0.001    0.001    4.275    4.275 /skypilot/sky/optimizer.py:231(_estimate_nodes_cost_or_time)
      646    0.002    0.000    2.648    0.004 /skypilot/sky/clouds/service_catalog/__init__.py:22(_map_clouds_catalog)
        1    0.000    0.000    1.929    1.929 /skypilot/sky/optimizer.py:1144(_fill_in_launchable_resources)
      169    0.001    0.000    1.676    0.010 /skypilot/sky/resources.py:944(get_reservations_available_resources)
      121    0.000    0.000    1.675    0.014 /skypilot/sky/clouds/gcp.py:936(get_reservations_available_resources)
      121    0.000    0.000    1.675    0.014 /skypilot/sky/clouds/utils/gcp_utils.py:121(list_reservations_for_instance_type_in_zone)
        1    0.000    0.000    1.674    1.674 /skypilot/sky/clouds/utils/gcp_utils.py:129(_list_reservations_for_instance_type)
        1    0.000    0.000    1.674    1.674 /skypilot/sky/utils/subprocess_utils.py:144(run_with_retries)
        1    0.001    0.001    1.674    1.674 /skypilot/sky/skylet/log_lib.py:129(run_with_log)
        1    0.000    0.000    1.657    1.657 /skypilot/sky/skylet/log_lib.py:105(process_subprocess_stream)
        1    0.000    0.000    1.640    1.640 /skypilot/sky/skylet/log_lib.py:53(_handle_io_stream)
        6    0.000    0.000    1.388    0.231 /skypilot/sky/clouds/cloud.py:321(get_feasible_launchable_resources)
        6    0.000    0.000    1.384    0.231 /skypilot/sky/clouds/service_catalog/__init__.py:180(get_default_instance_type)
        1    0.000    0.000    0.788    0.788 /skypilot/sky/clouds/aws.py:419(_get_feasible_launchable_resources)
        1    0.000    0.000    0.787    0.787 /skypilot/sky/clouds/aws.py:338(get_default_instance_type)
        1    0.000    0.000    0.752    0.752 /skypilot/sky/clouds/service_catalog/aws_catalog.py:215(get_default_instance_type)
      102    0.000    0.000    0.731    0.007 /skypilot/sky/clouds/service_catalog/aws_catalog.py:154(_get_df)
        1    0.009    0.009    0.731    0.731 /skypilot/sky/clouds/service_catalog/aws_catalog.py:93(_fetch_and_apply_az_mapping)
      292    0.020    0.000    0.681    0.002 /skypilot/sky/clouds/service_catalog/common.py:146(_get_instance_type)
        1    0.000    0.000    0.676    0.676 /skypilot/sky/clouds/aws.py:595(get_current_user_identity)
      168    0.000    0.000    0.626    0.004 /skypilot/sky/resources.py:890(get_cost)
      168    0.000    0.000    0.626    0.004 /skypilot/sky/clouds/service_catalog/__init__.py:152(get_hourly_cost)
      168    0.001    0.000    0.621    0.004 /skypilot/sky/clouds/service_catalog/common.py:236(get_hourly_cost_impl)
        6    0.000    0.000    0.541    0.090 /skypilot/sky/optimizer.py:1086(_make_launchables_for_valid_region_zones)
        1    0.000    0.000    0.524    0.524 /skypilot/sky/clouds/oci.py:300(_get_feasible_launchable_resources)
        1    0.000    0.000    0.523    0.523 /skypilot/sky/clouds/oci.py:167(get_default_instance_type)
      175    0.001    0.000    0.520    0.003 /skypilot/sky/resources.py:1090(copy)
      181    0.001    0.000    0.518    0.003 /skypilot/sky/resources.py:48(__init__)
        1    0.000    0.000    0.514    0.514 /skypilot/sky/clouds/service_catalog/oci_catalog.py:101(get_default_instance_type)
      121    0.000    0.000    0.507    0.004 /skypilot/sky/clouds/gcp.py:274(instance_type_to_hourly_cost)
      121    0.000    0.000    0.504    0.004 /skypilot/sky/clouds/service_catalog/gcp_catalog.py:214(get_hourly_cost)
       16    0.000    0.000    0.487    0.030 /skypilot/sky/clouds/service_catalog/oci_catalog.py:35(_get_df)
      181    0.000    0.000    0.446    0.002 /skypilot/sky/resources.py:553(_validate_and_set_region_zone)
      168    0.000    0.000    0.445    0.003 /skypilot/sky/clouds/service_catalog/__init__.py:127(validate_region_zone)
      168    0.001    0.000    0.440    0.003 /skypilot/sky/clouds/service_catalog/common.py:166(validate_region_zone_impl)
      146    0.000    0.000    0.436    0.003 /skypilot/sky/clouds/cloud.py:463(validate_region_zone)
      168    0.019    0.000    0.392    0.002 /skypilot/sky/clouds/service_catalog/common.py:352(_filter_region_zone)
...

Looks like at least a few things to optimize:

  • list_reservations_for_instance_type_in_zone
  • _fetch_and_apply_az_mapping
  • ...

concretevitamin avatar Feb 14 '24 21:02 concretevitamin

Still relevant in a many-cloud setting. With latest #3280 merged and 6 clouds enabled, sky launch takes ~3.8s on my laptop.

concretevitamin avatar Mar 07 '24 23:03 concretevitamin

@concretevitamin If there is no active development on this issue, I'd like to take it up.

aylei avatar Dec 17 '24 10:12 aylei

I just reproduced this issue with 7 providers enabled on my test branch https://github.com/aylei/skypilot/commit/ffdec30999f0553c363912ec6b9eec82f210335b (based off f0ebf13b) and get some thoughts on the solution, correct me if I wrong:

> sky check
...
🎉 Enabled clouds 🎉
  ✔ AWS
  ✔ Azure
  ✔ Cudo
  ✔ GCP
  ✔ Kubernetes
  ✔ Paperspace
  ✔ RunPod

> SKYPILOT_TIMELINE_FILE_PATH=timeline.json sky launch --gpus H100:8
Considered resources (1 node):
-------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE         vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE        COST ($)   CHOSEN
-------------------------------------------------------------------------------------------------------
 RunPod       8x_H100_SECURE   128     640       H100:8         CA                 35.92         ✔
 Paperspace   H100x8           128     640       H100:8         East Coast (NY2)   47.60
 GCP          a3-highgpu-8g    208     1872      H100:8         us-central1-a      87.83
 AWS          p5.48xlarge      192     2048      H100:8         us-east-1          98.32
-------------------------------------------------------------------------------------------------------
Aborted!

> head -15 profile_stats.txt
         2225298 function calls (2193864 primitive calls) in 3.466 seconds

   Ordered by: cumulative time
   List reduced from 8052 to 7609 due to restriction <'sky'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    3.466    3.466 /Users/aylei/repo/skypilot-org/skypilot/sky/cli.py:557(_launch_with_confirm)
     38/2    0.000    0.000    3.315    1.658 /Users/aylei/repo/skypilot-org/skypilot/sky/utils/common_utils.py:376(_record)
        1    0.000    0.000    3.313    3.313 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:108(optimize)
        1    0.000    0.000    3.311    3.311 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:993(_optimize_dag)
        1    0.000    0.000    3.302    3.302 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:240(_estimate_nodes_cost_or_time)
        1    0.000    0.000    3.085    3.085 /Users/aylei/repo/skypilot-org/skypilot/sky/optimizer.py:1257(_fill_in_launchable_resources)
        7    0.000    0.000    3.008    0.430 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/cloud.py:371(get_feasible_launchable_resources)
      136    0.000    0.000    2.720    0.020 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/service_catalog/__init__.py:21(_map_clouds_catalog)
        6    0.000    0.000    2.426    0.404 /Users/aylei/repo/skypilot-org/skypilot/sky/clouds/service_catalog/__init__.py:247(get_instance_type_for_accelerator)

According to the stats (profile_stats.txt) _fill_in_launchable_resources takes 3.085s out of 3.466s. Focusing this particular function in timeline events (timeline.json), the main bottleneck I've identified is that the get_feasible_launchable_resources method of each cloud is processed in serial:

> cat timeline.json| jq -r '.traceEvents.[] | "\(.name), \(.ph): \(.ts)"' | grep -v get_df
sky.backends.backend_utils.refresh_cluster_status_handle, B:  1734446409752685.000
sky.backends.backend_utils.refresh_cluster_status_handle, E:  1734446409754911.000
sky.optimizer.Optimizer.optimize, B:  1734446409755096.000
sky.optimizer._fill_in_launchable_resources, B:  1734446409757335.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446409757601.000
sky.clouds.kubernetes.Kubernetes._get_feasible_launchable_resources, B:  1734446410329253.250
sky.clouds.kubernetes.Kubernetes._get_feasible_launchable_resources, E:  1734446410337647.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446410337653.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446410337664.000
sky.clouds.gcp.GCP._get_feasible_launchable_resources, B:  1734446410337706.000
sky.clouds.gcp.GCP._get_feasible_launchable_resources, E:  1734446410634607.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446410634616.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446410683903.000
sky.clouds.azure.Azure._get_feasible_launchable_resources, B:  1734446410683945.000
sky.clouds.azure.Azure._get_feasible_launchable_resources, E:  1734446410696483.250
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446410696488.750
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446410696500.000
sky.clouds.cudo.Cudo._get_feasible_launchable_resources, B:  1734446410696539.000
sky.clouds.cudo.Cudo._get_feasible_launchable_resources, E:  1734446411555284.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446411555307.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446411555342.250
sky.clouds.aws.AWS._get_feasible_launchable_resources, B:  1734446411555458.000
sky.clouds.service_catalog.aws_catalog._fetch_and_apply_az_mapping, B:  1734446411561200.750
sky.clouds.service_catalog.aws_catalog._fetch_and_apply_az_mapping, E:  1734446412795312.250
sky.clouds.aws.AWS._get_feasible_launchable_resources, E:  1734446412810339.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446412810345.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446412831575.250
sky.clouds.runpod.RunPod._get_feasible_launchable_resources, B:  1734446412831607.000
sky.clouds.runpod.RunPod._get_feasible_launchable_resources, E:  1734446412834569.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446412834573.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, B:  1734446412839013.000
sky.clouds.paperspace.Paperspace._get_feasible_launchable_resources, B:  1734446412839048.000
sky.clouds.paperspace.Paperspace._get_feasible_launchable_resources, E:  1734446412841263.000
sky.clouds.cloud.Cloud.get_feasible_launchable_resources, E:  1734446412841266.750
sky.optimizer._fill_in_launchable_resources, E:  1734446412842773.000
sky.optimizer.Optimizer.optimize, E:  1734446413068050.000

Looking at the individual providers, the slowest one is AWS. get_user_identities constantly takes >1 second in my 5 runs. Meanwhile, the _unsupported_features_for_resources of Kubernetes involves RPC calls to list Nodes, which would have unpredictable latency (in my test, I used a local Kind cluster on my laptop, so the network latency is stable and minimal).

According to above investigation, I'd like to break down this issue into several sub issues:

  1. make get_feasible_launchable_resources concurrent, which I think is the most crucial part as it affects the scalability in multi-cloud scenario;

  2. dive into the get_feasible_launchable_resources impl of each provider and investigate whether there are available optimizations, especially for AWS and K8S;

  3. ~~(for long-term observability) replace the home-made timeline.Event with trace standards of OpenTelemetry, so that the performance can be analyzed and tracked more easily.~~ UPDATE(2024-12-20): My previous understanding is wrong, just read the standards timeline.py follows and found it can be visualized by various tools. It is nice and we really need to document it in our dev guide~

@concretevitamin @Michaelvll would you please kindly review this approach?

aylei avatar Dec 17 '24 15:12 aylei

That sounds great @aylei! Thanks for sharing the detailed analysis. Let's work on the proposals you mentioned. : )

Michaelvll avatar Dec 17 '24 23:12 Michaelvll

That sounds great @aylei! Thanks for sharing the detailed analysis. Let's work on the proposals you mentioned. : )

Thanks @Michaelvll! I will push forward.

aylei avatar Dec 18 '24 13:12 aylei

With https://github.com/skypilot-org/skypilot/pull/4483 close to merge, I'd like to share my thoughts on the aws and k8s part.

As far as I understand, in order to build an accurate optimizer table, AWS identity and K8S node info are necessary. Since they are now fetched via RPC in parallel, let's focus on the slower AWS call (which is the bottle in parallelization, cost ~1s). An natural idea is to cache the result between sky launch calls in sqlite or local filesystem. However, I think there is no reasonable cache key option. A possible solution is to hash the output of aws configure list as the cache key:

$ aws configure list
      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile                <not set>             None    None
access_key     ****************ISWR shared-credentials-file
secret_key     ****************5EEV shared-credentials-file
    region                us-west-2      config-file    ~/.aws/config

$ aws configure list help
NAME
       list -

DESCRIPTION
       Lists the profile, access key, secret key, and region configuration
       information used for the specified profile. For each configuration
       item, it shows the value, where the configuration value was retrieved,
       and the configuration variable name.

       For example, if you provide the AWS region in an environment variable,
       this command shows you the name of the region you've configured, that
       this value came from an environment variable, and the name of the
       environment variable.

       For temporary credential methods such as roles and IAM Identity Center,
       this command displays the temporarily cached access key and secret
       access key is displayed.

This works since changes of aws credentials would finally reflected in the output and the last 8 letters of the key displayed indicate rare collision rate. However, this is not an authoritative solution and I don't think we can expect the semantics of this command to be stable. Personally I think the current approach is fine and the above optimization does not payoff.

I think I am stuck here, what do you recommend? @Michaelvll Any thoughts would be highly appreciated!

aylei avatar Dec 20 '24 06:12 aylei

As for k8s, node info can be cached with a TTL, however, since node info is highly dynamic (a hypothetical but reasonable case is when sky launch does not find any node fit, the user may modify labels of existing nodes or add a new node to the cluster and try again). Therefore, a command-line-flag to refresh the cache would be a must-have, by which the added mental burden would beat the performance gain in my point of view.

aylei avatar Dec 20 '24 13:12 aylei

Hi @aylei, thanks for the analysis. Actually, it seems like using the aws configure list is not that bad idea, as get-caller-identity is actually on the critical path of many of SkyPilot operations, such as sky status, sky exec, if the caching can save us 1s on many of the operations, it might be good to try. : )

Michaelvll avatar Dec 21 '24 02:12 Michaelvll

Hi @aylei, thanks for the analysis. Actually, it seems like using the aws configure list is not that bad idea, as get-caller-identity is actually on the critical path of many of SkyPilot operations, such as sky status, sky exec, if the caching can save us 1s on many of the operations, it might be good to try. : )

Sounds good! I will investigate it.

aylei avatar Dec 21 '24 13:12 aylei

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Jun 21 '25 02:06 github-actions[bot]

This issue was closed because it has been stalled for 10 days with no activity.

github-actions[bot] avatar Jul 01 '25 02:07 github-actions[bot]