terraform-provider-iterative icon indicating copy to clipboard operation
terraform-provider-iterative copied to clipboard

readme intro suggestions

Open dberenbaum opened this issue 2 years ago • 0 comments

TPI is a Terraform plugin built with machine learning in mind. This CLI tool offers full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.

  • Lower cost with spot recovery: transparent data checkpoint/restore & auto-respawning of low-cost spot/preemptible instances
  • No cloud vendor lock-in: switch between clouds with just one line thanks to unified abstraction
  • No waste: auto-cleanup unused resources (terminate compute instances upon task completion/failure & remove storage upon download of results), pay only for what you use
  • Developer-first experience: one-command data sync & code execution with no external server, making the cloud feel like a laptop

Sorry for not doing this earlier, but hopefully it's still valuable to discuss.

I think the first sentence should focus on user workflow and not tools (Terraform). What about something like "Run your ML training in the cloud... without needing to be a cloud expert"?

The individual bullets feel more like they belong under "Why TPI?" They don't explain what TPI does or when to use it as much as they describe its particular benefits over other solutions. I wouldn't understand enough to care until I knew more about what it does. Some ideas for bullets here (the ones in brackets are probably less essential to the basic workflow even though they provide major benefits):

  • Configure everything (commands, data to sync, cloud resource requirements) in a single file.
  • Upload the data and run the job in the cloud with a single command.
  • [Get live logs and outputs from your local machine.]
  • [Keep the job running even if it's interrupted or your local machine shuts down.]
  • Automatically download the results and tear down the cloud resources when complete.

Also consider embedding https://www.youtube.com/watch?v=2fEgO8SazSE. I think this gives a great succinct description of how to use TPI to scale up your ML training.

dberenbaum avatar May 06 '22 15:05 dberenbaum