terraform-provider-iterative Revise universal cloud regions and machine types

Cloud vendors are famous for deprecating old offerings and creating new ones at breakneck speed, and users will have a better experience if they adhere to the most recent vendor-specific regions and machines. For those users that don't care about the details, we can provide some reference values for each vendor as part of our own documentation, but hardcoding them as we're doing now might not be a good idea.

Jul 04 '21 03:07 0x2b3bfa0

Are we speaking about types and regions?

Jul 09 '21 12:07 DavidGOrtega

Yes. By the way, this doesn't have to be a breaking change: the drop support in the title might be a bit dramatic, but we can just move them to a separate file inside the package (i.e. aws/provider.go for code and aws/aliases.go for abstract regions and machine types) and discontinue them without breaking anything.

Jul 09 '21 13:07 0x2b3bfa0

The mapping looks like a handy feature. In some cases, I don't care where the job is running and cross-cloud "interoperability" helps me.

At the same time, we should fully support all the real regions and not "being smart" about the mapping and provide full control. I don't see how this "smart logic" contradicts with full control. If a user specifies an actual region us-west-2 then we should support it properly, if user uses us-west - CML can do the smart mapping and ensure cross-cloud interoperability.

Oct 05 '21 16:10 dmpetrov

Instances

If we provide default, generic instance types, they should be documented somewhere and guarantee a minimum of power/capacity across clouds; providing “roughly 3 GPU devices” is not enough.

Turning it into a documentation problem

We can offer an interactive cloud instance finder as part of cml.dev/doc, so users can find the most appropriate instance for their use case.

Captura de pantalla 2021-10-05 a las 22 50 50

Turning it into a code problem

We can allow users to specify CPU/GPU/RAM/HDD as separate values and try to find a greater or equal instance. While this approach looks fairly convenient, it's, by no means, a substitute for documentation. If I were a user, I would not like to blindly try value combination until the program stopped complaining.

Keeping the current approach

We can also keep using a simple, curated table of cheap instance and document the exact characteristics of each one. Types like m or l may make sense for t–shirts, but they definitely don't for cloud instances; at least, not without a supporting table in the documentation.

Oct 05 '21 20:10 0x2b3bfa0

Regions

Offering universal regions might be, by far, less useful for the end user.

We can provide a default value for each cloud vendor for users who don't care, but those who do will surely want to specify the exact region, and not a ballpark approximation with a demi–continental precision.

Oct 05 '21 21:10 0x2b3bfa0

We can provide a default value for each cloud vendor for users who don't care, but those who do will surely want to specify the exact region, and not a ballpark approximation with a demi–continental precision.

They can still setup the country that they want. I like the idea of approximate the closest machine. One feature that I have been dreaming about is be able to not have to specify the cloud suggesting the cheapest one

Oct 06 '21 16:10 DavidGOrtega

Useful ideas from this Slack thread:

Common format

To my mind, universal, human–readable quantities provide the best user experience, allowing users to specify the task requirements. Our code would be in charge of finding the cheapest/smallest instance that fulfills those requirements.

Still, this approach has a really important downside: there is no way to specify an exact instance type.

resources:
  cores: 8
  memory: 16
  accelerators:
    nvidia-tesla-k80: 4
  storage: 512

Memory and storage requirements could be gigabytes or, perhaps, arbitrary units from docker/go-units→size.go

Nov 10 '21 18:11 0x2b3bfa0

https://github.com/iterative/terraform-provider-iterative/pull/525#issuecomment-1106217871

Automatically retrieving cloud instance pricing data:

EDIT: moved to #564

Apr 25 '22 14:04 0x2b3bfa0

https://github.com/iterative/terraform-provider-iterative/pull/533#issue-1214184818

#533 removes the GPU model selectors for Kubernetes, because AKS sets the accelerator node label to nvidia and it's not possible to specify a different value when creating the cluster; i.e. az aks create --nodepool-labels can't override the accelerator label with more granular values like nvidia-tesla-v100 mentioned in the official documentation.

https://github.com/iterative/terraform-provider-iterative/blob/765225c9b967a25cbdb8719370b3899665e024b4/.github/workflows/smoke.yml#L97

We can use weighted affinity, prioritizing the exact GPU model but tolerating any NVIDIA GPU device to avoid errors. It's a compromise between granularity and usability.

Apr 27 '22 17:04 0x2b3bfa0

terraform-provider-iterative terraform-provider-iterative copied to clipboard

Revise universal cloud regions and machine types

Instances

Turning it into a documentation problem

Turning it into a code problem

Keeping the current approach

Regions

Common format

https://github.com/iterative/terraform-provider-iterative/pull/525#issuecomment-1106217871

https://github.com/iterative/terraform-provider-iterative/pull/533#issue-1214184818

terraform-provider-iterative
terraform-provider-iterative copied to clipboard