terraform-provider-iterative icon indicating copy to clipboard operation
terraform-provider-iterative copied to clipboard

filesystem: support btrfs/xfs

Open casperdcl opened this issue 2 years ago • 11 comments

Usual default is ext4 which doesn't support reflinks. Would be great to make it easy to choose (or maybe default to) a different system.

Probably needs block-based storage (e.g. AWS EBS & formatting).

  • related https://github.com/iterative/cml/issues/561#issuecomment-871019350

casperdcl avatar Dec 07 '21 06:12 casperdcl

Does this really belong to the CML repository? 🤔

0x2b3bfa0 avatar Dec 07 '21 10:12 0x2b3bfa0

Well even though implementation would need to be done in TPI, it would also need exposure in CML as well (unless we make it default behaviour).

casperdcl avatar Dec 07 '21 10:12 casperdcl

if custom images are/become supported would that be better to handle on the image front?

dacbd avatar Dec 07 '21 16:12 dacbd

images could handle formatting but I'm not sure whether that would be enough - surely the underlying filesystem needs to support it (i.e. be block-like)?

casperdcl avatar Dec 07 '21 17:12 casperdcl

I don't follow, can't you build an image using a different FS? Yes, the fs needs to support reflinks for DVC to take advantage, setting this up makes sense to me on long-lived systems to reduce disk space consumed, but does this make sense for a cml runner use where the main feature is the ephemeral aspect of the training instance?

dacbd avatar Dec 07 '21 17:12 dacbd

This belongs to terraform-provider-iterative

DavidGOrtega avatar Dec 07 '21 17:12 DavidGOrtega

Note: as per https://github.com/iterative/cml/issues/561#issuecomment-871019350, we've chosen to use object storage instead of block storage for caching.

if custom images are/become supported would that be better to handle on the image front?

Yes, if you're willing to build custom images and use the same disk for both operating system and data. We already support this scenario.

[...] _surely the underlying filesystem needs to support it (i.e. be block-like)?

Block filesystems will only work in block devices... or in loop devices. 🤔 What other scenarios do you have in mind? Putting a block filesystem on top of object storage? 🙃

0x2b3bfa0 avatar Dec 08 '21 15:12 0x2b3bfa0

I don't follow, can't you build an image using a different FS? Yes, the fs needs to support reflinks for DVC to take advantage, setting this up makes sense to me on long-lived systems to reduce disk space consumed, but does this make sense for a cml runner use where the main feature is the ephemeral aspect of the training instance?

For a CML runner, no this isn't a requirement. But that's not the use case we are talking about here. In DVC, we would like to be able to start a (potentially long-lived) machine and run a lot of DVC experiments on that machine that will all share a common cache and would benefit from being able to reflink to/from that common cache. So yes, having access to standardized images that use btrfs/xfs instead of ext4 would be very nice to have.

Yes, if you're willing to build custom images and use the same disk for both operating system and data. We already support this scenario.

If we (on the DVC side) need to figure out how to make our own default images for each (aws/gcs/etc) platform then we can do that, but this is still something I would expect to be provided by TPI

And just to be clear, I understand that block volumes can only be attached to a single machine instance at a time. I'm not talking about having multiple machine instances sharing a DVC cache. I'm talking about having multiple jobs running (either sequentially or in parallel) on the single machine instance, and being able to take advantage of having an FS that supports reflinks on that single instance.

pmrowla avatar Dec 09 '21 03:12 pmrowla

I guess we have really enjoyed how independent each of these tools are and I could see how dvc exp and TPI's task could work together, I'm missing the iterative vision for this is really trying to accomplish.

IMO this is reaching beyond what a "terraform provider" should try and do. If I was to set up a long-lived instance for a Data Scientist I would probably use the regular aws/gcp terraform provider and configure it with something like ansible and they can use a remote connection with vscode.

I guess having these premade images would be nice and then having an additional data disk that you keep past the life of the instance which you can mount to another machine later could be nice, but keeping large EBS disks can get pretty pricey over the long term.

dacbd avatar Dec 09 '21 16:12 dacbd

I guess we have really enjoyed how independent each of these tools are and I could see how dvc exp and TPI's task could work together, I'm missing the iterative vision for this is really trying to accomplish.

This does not necessarily have to be provided by iterative_task.

IMO this is reaching beyond what a "terraform provider" should try and do. If I was to set up a long-lived instance for a Data Scientist I would probably use the regular aws/gcp terraform provider and configure it with something like ansible and they can use a remote connection with vscode.

This is also more along the lines of what the DVC team was thinking. So terraform-provider-iterative would be used to define the cloud agnostic config for provisioning standardized machines across aws/gcp/etc (i.e. using iterative_machine strictly for machine provisioning, startup and teardown, but doing everything else separately from terraform)

pmrowla avatar Dec 10 '21 01:12 pmrowla

I think this can be wrapped up as a feature for mounting cache/persistent volumes. Doesn't quite enable what was described by @pmrowla but is a step in that direction.

dacbd avatar Oct 17 '22 14:10 dacbd