terraform-provider-iterative
terraform-provider-iterative copied to clipboard
Revise universal cloud regions and machine types
Cloud vendors are famous for deprecating old offerings and creating new ones at breakneck speed, and users will have a better experience if they adhere to the most recent vendor-specific regions and machines. For those users that don't care about the details, we can provide some reference values for each vendor as part of our own documentation, but hardcoding them as we're doing now might not be a good idea.
Are we speaking about types and regions?
Yes. By the way, this doesn't have to be a breaking change: the drop support in the title might be a bit dramatic, but we can just move them to a separate file inside the package (i.e. aws/provider.go for code and aws/aliases.go for abstract regions and machine types) and discontinue them without breaking anything.
The mapping looks like a handy feature. In some cases, I don't care where the job is running and cross-cloud "interoperability" helps me.
At the same time, we should fully support all the real regions and not "being smart" about the mapping and provide full control. I don't see how this "smart logic" contradicts with full control. If a user specifies an actual region us-west-2 then we should support it properly, if user uses us-west - CML can do the smart mapping and ensure cross-cloud interoperability.
Instances
If we provide default, generic instance types, they should be documented somewhere and guarantee a minimum of power/capacity across clouds; providing “roughly 3 GPU devices” is not enough.
Turning it into a documentation problem
We can offer an interactive cloud instance finder as part of cml.dev/doc, so users can find the most appropriate instance for their use case.

Turning it into a code problem
We can allow users to specify CPU/GPU/RAM/HDD as separate values and try to find a greater or equal instance. While this approach looks fairly convenient, it's, by no means, a substitute for documentation. If I were a user, I would not like to blindly try value combination until the program stopped complaining.
Keeping the current approach
We can also keep using a simple, curated table of cheap instance and document the exact characteristics of each one. Types like m or l may make sense for t–shirts, but they definitely don't for cloud instances; at least, not without a supporting table in the documentation.
Regions
Offering universal regions might be, by far, less useful for the end user.
We can provide a default value for each cloud vendor for users who don't care, but those who do will surely want to specify the exact region, and not a ballpark approximation with a demi–continental precision.
We can provide a default value for each cloud vendor for users who don't care, but those who do will surely want to specify the exact region, and not a ballpark approximation with a demi–continental precision.
They can still setup the country that they want. I like the idea of approximate the closest machine. One feature that I have been dreaming about is be able to not have to specify the cloud suggesting the cheapest one
Useful ideas from this Slack thread:
Common format
To my mind, universal, human–readable quantities provide the best user experience, allowing users to specify the task requirements. Our code would be in charge of finding the cheapest/smallest instance that fulfills those requirements.
Still, this approach has a really important downside: there is no way to specify an exact instance type.
resources:
cores: 8
memory: 16
accelerators:
nvidia-tesla-k80: 4
storage: 512
Memory and storage requirements could be gigabytes or, perhaps, arbitrary units from docker/go-units→size.go
https://github.com/iterative/terraform-provider-iterative/pull/525#issuecomment-1106217871
Automatically retrieving cloud instance pricing data:
EDIT: moved to #564
https://github.com/iterative/terraform-provider-iterative/pull/533#issue-1214184818
#533 removes the GPU model selectors for Kubernetes, because AKS sets the accelerator node label to nvidia and it's not possible to specify a different value when creating the cluster; i.e. az aks create --nodepool-labels can't override the accelerator label with more granular values like nvidia-tesla-v100 mentioned in the official documentation.
https://github.com/iterative/terraform-provider-iterative/blob/765225c9b967a25cbdb8719370b3899665e024b4/.github/workflows/smoke.yml#L97
We can use weighted affinity, prioritizing the exact GPU model but tolerating any NVIDIA GPU device to avoid errors. It's a compromise between granularity and usability.