dstack Issues with RunPod as an offline provider

Once the RunPod was added, we observed a few of issues with its functionality.

The availability information is accurate at the moment the API call is made.
The list of offers does not show all possible offers
What problems need to be resolved for RunPod to become an online provider?

There are two issues in which the integrity tests are fixed according to the current availability (dstackai/gpuhunt#56, dstackai/gpuhunt#58).

Apr 11 '24 11:04 TheBits

@TheBits 2. The list of offers does not show all possible offers Are all possible offers not present even during the API call?

Apr 11 '24 12:04 Bihan

@Bihan Here's what @TheBits meant: The static catalog now includes only the offers that were available at the time of its generation. Later, when a user invokesdstack run, it shows the available offers based on the static catalog that may be not relevant anymore. Thus, the user may not see some of the actually available offers.

Apr 11 '24 13:04 peterschmidt85

Once the RunPod was added, we observed a few of issues with its functionality.

The availability information is accurate at the moment the API call is made.

The list of offers does not show all possible offers

What problems need to be resolved for RunPod to become an online provider?

There are two issues in which the integrity tests are fixed according to the current availability (dstackai/gpuhunt#56, dstackai/gpuhunt#58).

@TheBits 3. What problems need to be resolved for RunPod to become an online provider? A. Best resolution: To make runpod an online provider, runpod should provide a single api which responds with available machines with their gpu counts and datacenter. Current Api requires gpu count as a mandatory variable, also there is no datacenter information.

B. For the case when user provides gpu_count and region filter. Eg: dstack run . -b runpod -r EU-SE-1 --gpu 1, we can make it online. But for case dstack run . -b runpod we need to call api multiple times for datacenter and gpu_count. This multiple calls creates performance issue.

Note: I have observed that Runpod has changed its web layout. May be it might have changed api too. I am checking it.

Apr 11 '24 13:04 Bihan

@Bihan Here's what @TheBits meant: The static catalog now includes only the offers that were available at the time of its generation. Later, when a user invokesdstack run, it shows the available offers based on the static catalog that may be not relevant anymore. Thus, the user may not see some of the actually available offers.

@peterschmidt85 Yes that is true.

Apr 11 '24 13:04 Bihan

There is also option C – make our version of the static catalog that includes all the offers. Each night we will trigger RunPod's API and check that our internal catalog includes it. If we see a new offer, we add it to our catalog. How about that? That would take a bit more effort but won't require an API from RunPod.

Apr 11 '24 13:04 peterschmidt85

There is also option C – make our version of the static catalog that includes all the offers. Each night we will trigger RunPod's API and check that our internal catalog includes it. If we see a new offer, we add it to our catalog. How about that? That would take a bit more effort but won't require an API from RunPod.

@peterschmidt85 Yes we can do that, but what if catalog changes faster than trigger interval?.

I want to share an idea to make Runpod online. Basically the idea is to follow the flow in which Runpod's web console works.

Eg: Case A: User requests for all offers dstack run . -b runpod List catalog offers online using Runpod's get gpu types. The api's response is [ { "maxGpuCount": 8, "id": "NVIDIA A100 80GB PCIe", "displayName": "A100 80GB", "manufacturer": "Nvidia", "memoryInGb": 80, "cudaCores": 0, "secureCloud": true, "communityCloud": true, "securePrice": 1.89, #price for gpu_count = 1 "communityPrice": 1.59, #price for gpu_count = 1 "communitySpotPrice": 0.89, #price for gpu_count = 1 }, {...}, {..}] . This response does not provide location, but we don't need it because user has not supplied region argument and we only need the cheapest offer. The region field can have value "Any"

Case A: User requests with gpu argument dstack run . -b runpod --gpu 3 List catalog offers as above with region = "Any". If the first option has machine with gpu count = 3, then start provisioning else automatically choose subsequent offers with gpu_count = 3.

I can explore the cases and try implementation.

Apr 11 '24 14:04 Bihan

This response does not provide location, but we don't need it because user has not supplied region argument and we only need the cheapest offer. The region field can have value "Any"

Not sure I'm fond of this one TBH.

Apr 11 '24 14:04 peterschmidt85

This response does not provide location, but we don't need it because user has not supplied region argument and we only need the cheapest offer. The region field can have value "Any"

Not sure I'm fond of this one TBH.

@peterschmidt85 Getting datacenter information requires 8(no of datacenter) api calls and is taking 2s. If 2s is a acceptable performance, then I can make Runpod online.

Apr 11 '24 15:04 Bihan

@Bihan But what about the option C I suggested above?

Apr 11 '24 15:04 peterschmidt85

@Bihan But what about the option C I suggested above?

@peterschmidt85 What if Runpod changes its catalog before the trigger happens?

Apr 11 '24 15:04 Bihan

In my opinion, 2 seconds is not a significant lag.

But what about the option C I suggested above?

@peterschmidt85 The number of offers with availability fluctuates frequently. At night, there were 207 offers. Right now, the number of offers between 185 and 189.

Apr 11 '24 15:04 TheBits

This issue is stale because it has been open for 30 days with no activity.

May 12 '24 01:05 peterschmidt85

This issue is stale because it has been open for 30 days with no activity.

@peterschmidt85 The solution is to implement Runpod as as online provider. However to implement as an online provider, we require an API which returns all machine types across all data centers. Such API is not offered by Runpod.

We do have a workaround to implement Runpod as an online provider, but the workaround comes with a performance issue. The performance issue is about the response time to get all the offers. It takes 2s to respond with all the offers.

May 13 '24 05:05 Bihan

Currently dstack uses gpuhunt runpod catalog collected daily. It includes only the offers available at the time of catalog generation. Since runpod availability changes throughout the day, some offer may appear/disappear when user runs dstack run.

A potentially good and simple solution could be to start collecting the runpod catalog more frequently (e.g. every hour). Some offers might still be missing but it won't be critical. The specific interval is to be determined.

Making runpod online provider is not an option at the moment.

Jun 11 '24 05:06 r4victor

Currently dstack uses gpuhunt runpod catalog collected daily. It includes only the offers available at the time of catalog generation. Since runpod availability changes throughout the day, some offer may appear/disappear when user runs dstack run.

A potentially good and simple solution could be to start collecting the runpod catalog more frequently (e.g. every hour). Some offers might still be missing but it won't be critical. The specific interval is to be determined.

Making runpod online provider is not an option at the moment.

@r4victor This means we need to modify github workflow to collect the Runpod catalog every hour, while the existing jobs continue to run daily as before. Should I modify the workflow?

Jun 11 '24 06:06 Bihan

@Bihan, yeah, one of the possible solutions would be to separate backend catalogs. This will require refactoring of gpuhunt and also means introducing a new catalog version (v2) since the catalogs will be stored differently.

We can also trigger Collect and publish catalogs workflow more frequently for all providers (e.g. every hour).

@peterschmidt85, this solution won't cost us much and I'd recommend it since it's trivial to start with.

Jun 12 '24 09:06 r4victor

Agree!

Jun 12 '24 10:06 peterschmidt85