terraform-provider-iterative icon indicating copy to clipboard operation
terraform-provider-iterative copied to clipboard

`task` bucket usage vs "directory" within a bucket

Open DavidGOrtega opened this issue 2 years ago • 26 comments

We are generating a bucket with every task, however this approach has several outcomes and a bug:

  • We are limited to the number of tasks
  • Removal is tricky and maybe not desirable by the user. They might want to keep the tasks for a long time
  • If your quota is depleted you can not delete tasks even to create new ones bug

A better approach would be having specified a bucket by the user or if not create the default .tpi bucket.

DavidGOrtega avatar Nov 25 '21 13:11 DavidGOrtega

Limits

As per the research below, users can have at least 100 concurrent tasks running in any of the supported providers. Upon request, limits can be increased up to 250 per region in the worst case scenario.

Highly parallelizable tasks, like hyperparameter optimization, can use the parallelism argument to lauch up to thousands of machines with shared storage, without exceding any of these limits.

Moreover, the number of concurrent tasks is usually bound to the orchestration limits, not to the storage limits.

Storage

aws

By default, you can create up to 100 buckets in each of your AWS accounts. If you need additional buckets, you can increase your account bucket limit to a maximum of 1,000 buckets by submitting a service limit increase. There is no difference in performance whether you use many buckets or just a few.

az

Number of storage accounts per region per subscription, including standard, and premium storage accounts: 250.

gcp

There are no limits on the number of buckets you can create in Google Cloud Storage.Jeff Terrace, Senior Software Engineer at Google, Google Cloud Storage.

k8s

There is no limit for the number of persistent volume claims beyond the provisioning limits of the underlying storage class.

Orchestration

aws

Auto Scaling groups per Region: 200 [...] To request an increase, use the Auto Scaling Limits form.

az

Maximum number of scale sets in a region: 2,500

gcp

Instance groups: Quota [...] This quota is per project. (empirically, defaults to 100)

k8s

There is no limit for the number of jobs that can be created in a cluster.

0x2b3bfa0 avatar Nov 25 '21 23:11 0x2b3bfa0

Deletion

Tasks have been designed to be deleted as soon as they finish or, rather, as soon as the user realizes that they have finished; e.g. the next morning. Even in moderately big teams, it would be unusual to have more than 100 concurrent tasks when following this approach.[^1]

Teams with more than ~20 data scientists will necessarily have to be backed by specialized DevOps engineers, who already have the ability of increasing those limits, and even overcoming them with workarounds.[^3]

[^1]: Tasks, not machines; as per https://github.com/iterative/terraform-provider-iterative/issues/299#issuecomment-979514335, the latter have much higher limits. [^3]: In the worst case, many of these limits can be avoided by using multiple regions, accounts or other provider–specific partitions.

0x2b3bfa0 avatar Nov 25 '21 23:11 0x2b3bfa0

Persistence

Task storage is not meant to replace persistent data/model storage and versioning tools like DVC. It's only meant to share state (e.g. checkpoints) between several short–lived machines.

Even DVC, which is a data–oriented tool, doesn't[^1] include a mechanism to create (and much less delete) persistent storage resources. This is something that most organizations prefer to manage separately and in often disparate ways.

Because of the diversity and complexity involved in persistent storage provisioning, it would be risky to include such a feature as part of this project.

Logs

Likewise, task storage is not meant to replace log management platforms.[^2] For the same reasons as data, log ingestion, storage and monitoring should be performed by means of specialized tools.

Responsibility

Infrastructure management tools have the responsibility of deleting all the resources they create, even when those resources are meant to exist for long periods of time. Creating long–lived resources implies providing a way of deleting them.[^3]

[^1]: At least, that's what the official documentation says. [^2]: For example, Amazon CloudWatch, Azure Monitor or Google Cloud Logging; links courtesy of @casperdcl. [^3]: Requiring ClickOps to delete automatically created resources is wrong on so many levels. Big organizations won't run a tool that exhibits such a behavior, and smaller ones shouldn't, even if they were willing to.

0x2b3bfa0 avatar Nov 26 '21 03:11 0x2b3bfa0

Here we go again! I having serious issues today with this. I can not even imagine people adopting this after the issues they might have. Right now i do not have more buckets because I use this in the CI and we do not have an effective destroy in place. I have requested more buckets to AWS and im still waiting approval! So Im totally locked. Seriously I wont use something like this

DavidGOrtega avatar Nov 29 '21 13:11 DavidGOrtega

Even in moderately big teams, it would be unusual to have more than 100 concurrent tasks when following this approach.

Who says? I can show you cases where 100 is super small. As we have been discussing a team with 3 models and a matrix can launch more than those 100 in a breeze and we do not have a perfect destroy in the CI yet

DavidGOrtega avatar Nov 29 '21 13:11 DavidGOrtega

As stated on https://github.com/iterative/terraform-provider-iterative/issues/299#issuecomment-979514335, matrix use cases would benefit of parallelism and would only consume a single bucket and a single orchestration resource for all the created machines. See 0x2b3bfa0/cml-use-case-matrix-task for a rudimentary example.

0x2b3bfa0 avatar Nov 29 '21 18:11 0x2b3bfa0

Unfortunately, quota requests take time. In the meanwhile, https://github.com/iterative/terraform-provider-iterative/issues/314 should be addressed in order to prevent the bug or, at least, to mitigate it. Let's demote this discussion to important after extracting the critical issue.

0x2b3bfa0 avatar Nov 29 '21 18:11 0x2b3bfa0

Comments on automatic deletion of task resources belong to https://github.com/iterative/terraform-provider-iterative/issues/289

0x2b3bfa0 avatar Nov 29 '21 18:11 0x2b3bfa0

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account: buckets to use, security groups/firewall rules, instance role/service account, image/ami, etc.

Many of these are already set/created for organizations and this tool is for helping to manage short-lived instances. It should manage as few "infrastructure" pieces as possible while still being easy to use/low barrier of entry for more agile users?

dacbd avatar Nov 29 '21 20:11 dacbd

matrix use cases would benefit of parallelism and would only consume a single bucket

Can you please write down an example of how does it looks like?

I do not think that 0x2b3bfa0/cml-use-case-matrix-task resembles a useful or realistic case and I do not fully understand the usage of task there, we are using the task yo launch the runners, however this approach is not yet right (data sync is totally useless). To make such case interesting and useful the runners should recover the previous data folder once they start after the spot termination and that does not happens since the workdir changes in very runner startup. An also its assuming that the training job will end before the workflow timeout

DavidGOrtega avatar Nov 29 '21 20:11 DavidGOrtega

Comments on file persistence for matrix tasks probably belong to https://github.com/0x2b3bfa0/cml-use-case-matrix-task/issues/1

0x2b3bfa0 avatar Nov 29 '21 23:11 0x2b3bfa0

Comments on GitHub's 72 hour timeout probably belong to https://github.com/0x2b3bfa0/cml-use-case-matrix-task/issues/2

0x2b3bfa0 avatar Nov 30 '21 00:11 0x2b3bfa0

@0x2b3bfa0 you were the one here pointing that example as a reason of why this PR should be deprecated. I try to clarify why that the example might not only not realistic containing also all the drawbacks that we always had with the runners.

DavidGOrtega avatar Nov 30 '21 08:11 DavidGOrtega

👍🏼 Yes, the example above is the closer we can get to running matrix jobs with an official CI/CD self–hosted runner, and has the same limitations as the previous approach regarding the 72 hour limit and workflows having to be restarted with every new machine.

Definitely far from ideal, but it's all what we can do with official runners. Other solutions involve using directly the iterative_task resource as you envision. I would prefer not to divert this conversation beyond the central “one or many buckets” discussion, but can't agree more with you on the limitations of official self–hosted runners.

0x2b3bfa0 avatar Nov 30 '21 15:11 0x2b3bfa0

Architecture

Tasks have been designed to be completely ephemeral, and able to run in a pristine cloud account without additional configuration. Treating storage[^1] as an ephemeral resource may be an unusual choice, but there is no other way of avoiding a separate installation process.

If we wanted to use a shared bucket for all the tasks, it would make sense[^2] to embrace the official providers instead of writing our own, and just publish a module meant to deploy persistent resources — like an object storage bucket or an instance orchestrator — to be used by every task. GitHub recommends something similar, but it's still overcomplicated.

Designing a “new” task orchestrator out of cloud primitives (virtual machine orchestrator, queue, log aggregator, object storage, et cetera) would imply reinventing the heptagonal wheel as previously stated, and ultimately lead us to consider nodeless Kubernetes solutions based on Elotl Kip (source code) and Cluster API spot instances.

[^1]: See https://github.com/iterative/cml/issues/561#issuecomment-871019350 for context on the election of object storage over other types. [^2]: The current implementation still [ab]uses Terraform, ignoring the official Provider Design Principles in attrocious ways.

0x2b3bfa0 avatar Dec 03 '21 00:12 0x2b3bfa0

I'm inclined to think that it's fine to use ephemeral buckets to cache data and keep artifacts until users “harvest” them. Still, treating object storage buckets as an ephemeral resource looks like a pretty unusual practice, as @dmpetrov pointed out.

Pinging @duijf, @JIoJIaJIu and @shcheklein for a ~second~ sixth opinion as requested. It would be awesome to have more feedback on the possible alternatives:

  1. Use ephemeral buckets for every task and delete them as soon as users harvest the results
  2. Require users to provide an existing bucket and store artifacts in separate “directories” for each task

There might be other alternatives I overlooked, though.

0x2b3bfa0 avatar Dec 03 '21 00:12 0x2b3bfa0

  1. Use ephemeral buckets for every task and delete them as soon as users harvest the results

Please note, ephemeral buckets means an actual bucket (not a key/path in an existing bucket). An ephemeral bucket is supposed to be created in "root" with a temporary name like s3://xpd-my-test-30g0bew1pcghg and deleted once job is done.

  1. Require users to provide an existing bucket and store artifacts in separate “directories” for each task

This might be a path to an existing bucket with a path/key like user specified s3://iterative-ai/ml/segment/dmpetrov/ with dir name xpd-my-test-30g0bew1pcghg.

dmpetrov avatar Dec 03 '21 09:12 dmpetrov

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account

@dacbd would you prefer to have an ephemeral / temporary bucket like s3://xpd-my-test-30g0bew1pcghg for each task or a temp directory in a user specified path like s3://iterative-ai/ml/segment/xpd-my-test-30g0bew1pcghg?

A separate question - would you prefer to keep output/state/logs of the task after the task is done or failed or remove the bucket or directory xpd-my-test-30g0bew1pcghg?

dmpetrov avatar Dec 03 '21 09:12 dmpetrov

My 2 cents:

  • Try to play nice with existing infrastructure.
  • Directories within a bucket sounds like the best way to go.

More background below :)

Playing nice with existing infra

  • Medium - large orgs / teams probably already have data infrastructure + policies, etc. in place. Teams may have compliance requirements to document / track the purposes of their buckets, keep access logs, etc.
  • Mature ops teams already have tools to deal with these things. They already have these access control policies codified in CloudFormation / Terraform / Pulumi / etc. They probably don't want to move them to yet another tool. They probably want to define a role + limited set of buckets using the tools they already use and use TPI for the stuff it's good at.
  • If TPI would "own" the entire resource creation process / lifecycle, be aware that you are going to get a bunch of requests to expose different things that people care about, which can significantly increase the scope of the project. If you make it work with existing cloud resources, then you can sidestep this problem and say "you can always create resources manually".

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account

Very much agreed with this. Hopefully being able to specify existing resources isn't mutually exclusive with some sort of onboarding experience where TPI can abstract some stuff for users that are new to this.

Buckets + quotas

The quota problems look pretty serious to me. Even if you can boost your quotas, I wouldn't bet on it that people would be happy to give up significant portions of their quota just for TPI.

From the outside, it looks like a pretty arbitrary decision to have a bucket per task which also adds a lot of extra moving parts + papercuts. I would seriously consider going for directories in a bucket that already exists

Cleanup

A separate question - would you prefer to keep output/state/logs of the task after the task is done or failed or remove the bucket or directory xpd-my-test-30g0bew1pcghg?

Cleanup everything seems like a good default, but this should probably be configurable. "Always cleanup", "Cleanup on success", "Never cleanup" all make sense to me. (Not sure if there is a usecase for "Only cleanup on failure")

duijf avatar Dec 03 '21 10:12 duijf

  • Try to play nice with existing infrastructure.
  • Directories within a bucket sounds like the best way to go.

I have the same feeling. My understanding: you create and delete a bucket when you provision a new resource like a database or a new system deployment. An experiment / train-task is not a new resource, it is a just a run.

Terraform backend might be a good analogy - it is using an existing path, and it does not destroy the bucket:

terraform {
  backend "s3" {
    bucket         = "terraform-up-and-running-state"
    key            = "global/s3/terraform.tfstate"
    region         = "us-east-2"
  }
}

dmpetrov avatar Dec 03 '21 17:12 dmpetrov

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account

@dacbd would you prefer to have an ephemeral / temporary bucket like s3://xpd-my-test-30g0bew1pcghg for each task or a temp directory in a user specified path like s3://iterative-ai/ml/segment/xpd-my-test-30g0bew1pcghg?

A separate question - would you prefer to keep output/state/logs of the task after the task is done or failed or remove the bucket or directory xpd-my-test-30g0bew1pcghg?

@dmpetrov I think that an ephemeral bucket is a fine default but given a path it should use a directory at the base of the path: given s3://reducedredunancy/bucket it would use s3://reducedredunancy/bucket/xpd-my-test-30g0bew1pcghg given s3://30daylifecycle/policy/bucket it would use s3://30daylifecycle/policy/bucket/xpd-my-test-30g0bew1pcghg

and I'll reiterate @duijf in:

I am in favor of being able to specify as many (pre-existing)resources as possible for greater management of the overall cloud provider account

Very much agreed with this. Hopefully being able to specify existing resources isn't mutually exclusive with some sort of onboarding experience where TPI can abstract some stuff for users that are new to this.

dacbd avatar Dec 03 '21 21:12 dacbd

Thank you very much for the thorough feedback! ❤️

A good compromise would be adding a storage attribute to the task resource with the following behavior:

  1. When unset, create/delete ephemeral buckets as we do now
  2. When set, use the given prefix to create/delete “directories”

Examples

Cloud providers

storage = "bucket/path/prefix" to create “directories” on the specified bucket and (preferably) under the specified prefix.

Kubernetes

storage = "azurefile:30" to create a Persistent Volume Claim with a size of 30 GB (if applicable) from the azurefile Storage Class.

Limitations

  • Any resource specified through the storage attribute should already exist; i.e. be externally managed.
  • Leaving data in the cloud after destroying a task would only be possible when specifying the storage attribute.

Still, leaving data in the cloud after destroying the task comes with some challeges. Not sure if we should support this persistency use case out of the box.

0x2b3bfa0 avatar Dec 07 '21 16:12 0x2b3bfa0

Probably related to the API proposal on https://github.com/iterative/terraform-provider-iterative/issues/307#issuecomment-979684223: now we have enough storage–related attributes to consider creating a block for all the storage–related attributes.

0x2b3bfa0 avatar Dec 07 '21 16:12 0x2b3bfa0

Still, leaving data in the cloud after destroying the task comes with some challeges. Not sure if we should support this persistency use case out of the box.

I think that when using a predefined bucket it would be easy enough for the user to ensure persistence before running terraform destroy or since task tears down the instance when execution is all complete they can persistent by simply not running terraform destroy?

dacbd avatar Dec 07 '21 16:12 dacbd

Still, leaving data in the cloud after destroying the task comes with some challeges. Not sure if we should support this persistency use case out of the box.

🤔 to my mind is quite opposite.

I understand where the motivation comes from - TF destroy should clean up allocated resources. But I'm not sure how this is applied to our use case. I consider logs, data and config files that TFI copies to cloud as logs. And removing logs looks like a bit strange practice.

TF itself also does not follow its own rules of destroying resources. See example with TF backend https://github.com/iterative/terraform-provider-iterative/issues/299#issuecomment-985704953

It feels like we are introducing artificial rules here 🙂 I'd suggest providing maximum flexibility for users and not destroying logs until users directly ask for it (in config).

dmpetrov avatar Dec 07 '21 16:12 dmpetrov

related: terraform import existing resources to avoid the problem of creating & destroying things ourselves.

casperdcl avatar Dec 08 '21 20:12 casperdcl