terraform-provider-iterative icon indicating copy to clipboard operation
terraform-provider-iterative copied to clipboard

Revise universal cloud regions and machine types

Open 0x2b3bfa0 opened this issue 4 years ago • 9 comments

Cloud vendors are famous for deprecating old offerings and creating new ones at breakneck speed, and users will have a better experience if they adhere to the most recent vendor-specific regions and machines. For those users that don't care about the details, we can provide some reference values for each vendor as part of our own documentation, but hardcoding them as we're doing now might not be a good idea.

0x2b3bfa0 avatar Jul 04 '21 03:07 0x2b3bfa0

Are we speaking about types and regions?

DavidGOrtega avatar Jul 09 '21 12:07 DavidGOrtega

Yes. By the way, this doesn't have to be a breaking change: the drop support in the title might be a bit dramatic, but we can just move them to a separate file inside the package (i.e. aws/provider.go for code and aws/aliases.go for abstract regions and machine types) and discontinue them without breaking anything.

0x2b3bfa0 avatar Jul 09 '21 13:07 0x2b3bfa0

The mapping looks like a handy feature. In some cases, I don't care where the job is running and cross-cloud "interoperability" helps me.

At the same time, we should fully support all the real regions and not "being smart" about the mapping and provide full control. I don't see how this "smart logic" contradicts with full control. If a user specifies an actual region us-west-2 then we should support it properly, if user uses us-west - CML can do the smart mapping and ensure cross-cloud interoperability.

dmpetrov avatar Oct 05 '21 16:10 dmpetrov

Instances

If we provide default, generic instance types, they should be documented somewhere and guarantee a minimum of power/capacity across clouds; providing “roughly 3 GPU devices” is not enough.

Turning it into a documentation problem

We can offer an interactive cloud instance finder as part of cml.dev/doc, so users can find the most appropriate instance for their use case.

Captura de pantalla 2021-10-05 a las 22 50 50

Turning it into a code problem

We can allow users to specify CPU/GPU/RAM/HDD as separate values and try to find a greater or equal instance. While this approach looks fairly convenient, it's, by no means, a substitute for documentation. If I were a user, I would not like to blindly try value combination until the program stopped complaining.

Keeping the current approach

We can also keep using a simple, curated table of cheap instance and document the exact characteristics of each one. Types like m or l may make sense for t–shirts, but they definitely don't for cloud instances; at least, not without a supporting table in the documentation.

0x2b3bfa0 avatar Oct 05 '21 20:10 0x2b3bfa0

Regions

Offering universal regions might be, by far, less useful for the end user.

We can provide a default value for each cloud vendor for users who don't care, but those who do will surely want to specify the exact region, and not a ballpark approximation with a demi–continental precision.

0x2b3bfa0 avatar Oct 05 '21 21:10 0x2b3bfa0

We can provide a default value for each cloud vendor for users who don't care, but those who do will surely want to specify the exact region, and not a ballpark approximation with a demi–continental precision.

They can still setup the country that they want. I like the idea of approximate the closest machine. One feature that I have been dreaming about is be able to not have to specify the cloud suggesting the cheapest one

DavidGOrtega avatar Oct 06 '21 16:10 DavidGOrtega

Useful ideas from this Slack thread:

Common format

To my mind, universal, human–readable quantities provide the best user experience, allowing users to specify the task requirements. Our code would be in charge of finding the cheapest/smallest instance that fulfills those requirements.

Still, this approach has a really important downside: there is no way to specify an exact instance type.

resources:
  cores: 8
  memory: 16
  accelerators:
    nvidia-tesla-k80: 4
  storage: 512

Memory and storage requirements could be gigabytes or, perhaps, arbitrary units from docker/go-units→size.go

0x2b3bfa0 avatar Nov 10 '21 18:11 0x2b3bfa0

https://github.com/iterative/terraform-provider-iterative/pull/525#issuecomment-1106217871

Automatically retrieving cloud instance pricing data:

EDIT: moved to #564

0x2b3bfa0 avatar Apr 25 '22 14:04 0x2b3bfa0

https://github.com/iterative/terraform-provider-iterative/pull/533#issue-1214184818

#533 removes the GPU model selectors for Kubernetes, because AKS sets the accelerator node label to nvidia and it's not possible to specify a different value when creating the cluster; i.e. az aks create --nodepool-labels can't override the accelerator label with more granular values like nvidia-tesla-v100 mentioned in the official documentation.

https://github.com/iterative/terraform-provider-iterative/blob/765225c9b967a25cbdb8719370b3899665e024b4/.github/workflows/smoke.yml#L97

We can use weighted affinity, prioritizing the exact GPU model but tolerating any NVIDIA GPU device to avoid errors. It's a compromise between granularity and usability.

0x2b3bfa0 avatar Apr 27 '22 17:04 0x2b3bfa0