flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[Core feature] Support S3FS like mountable file systems and S3 interchangably

Open kumare3 opened this issue 2 years ago • 8 comments

Motivation: Why do you think this is important?

It is possible to mount S3 using S3FS. This would make it possible for users to simply write file system ops to read and write raw data and Flyte to read and write input / output metadata. Today this can be achieved using flytekitplugins-k8s (raw pods). Another method could be to allow platform maintainers to switch on support for a mountable file-system. In such a case Flyteplugins (pods) can simply translate the s3://... uris to local filesystem equivalents. This would also make it possible to easily interoperate between systems that do not support mountable file systems.

  • [ ] Document how to use S3FS etc using pods
  • [ ] Modify Flyteplugins to support this and document the protocol with a usecase sample

Goal: What should the final outcome look like, ideally?

  • [ ] Users should have document to understand how these file systems can be mounted.
  • [ ] Users with existing workflows should be able to transparently use S3FS without changing code

Describe alternatives you've considered

NA

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

kumare3 avatar Dec 10 '21 05:12 kumare3

I agree! we should look into K8s PVs and PVCs through CSIs (e.g. s3 csi) to offer this regardless of the target container language/SDK.

EngHabu avatar Dec 10 '21 05:12 EngHabu

I agree! we should look into K8s PVs and PVCs through CSIs (e.g. s3 csi) to offer this regardless of the target container language/SDK.

I know you want this 😉. But I want to do it swappable on python first. That way we can say for other Languages this is the only supported method

kumare3 avatar Dec 10 '21 06:12 kumare3

A few things to note here. s3fs is uncomfortably slow and is not Posix compliant. In our case, we need a fast posix compliant shared file system. ObjectiveFS achieves this but it is a SaaS product and thus is not free to use. The middle ground I believe is to support access to the device /dev/fuse from inside Flyte task pod containers allowing users to create mounts with whatever backend they need. I achieved this by running the containers as privileged but it could be done more securely by exposing the host mount utilities and privileges if they exist, which we will do before moving these containers into production.

Our use case is mounting multiple terabytes of data for computational tasks. These are generally not workflow inputs but rather large supporting datasets. If we were to attach ebs volumes, we would need to create them on the fly or have a cache of volumes which is unwieldy and expensive. If we were to use efs, our costs would be 10x s3 and operations would be slow. Some fast shared file system is needed with posix being near necessary to trust the filesystem (https://github.com/s3fs-fuse/s3fs-fuse#limitations).

AidanAbd avatar Dec 10 '21 15:12 AidanAbd

s3fs is uncomfortably slow and is not Posix compliant.

I agree this point. But the advantages of S3 are cheap and scalable. In our most case, the CPU/GPU is the bottleneck. We are happy with the performance of S3. The posix compliant is not mandatory. A file-like object is enough for data processing and training. If the throughput matters, the compression should mitigate the speed issue. The compression rate of most dataset files is between 1/2 to 1/20.

I try to subclass the FlyteFile, in case to read with S3FS skipping copy data to local filesystem. Auto-compression is another feature we like.

highfly22 avatar Aug 05 '22 03:08 highfly22

@highfly22 could you join slack.Flyte.org and help with ideas around this. Welcome to the community

kumare3 avatar Aug 07 '22 16:08 kumare3

@highfly22 could you join slack.Flyte.org and help with ideas around this. Welcome to the community

Sure. I wrote a gist to demo my idea.

https://gist.github.com/highfly22/54f9d976254ff373037aac34196dcaaf

highfly22 avatar Aug 09 '22 03:08 highfly22

cc @wild-endeavor / @pingsutw wdyt?

kumare3 avatar Aug 09 '22 05:08 kumare3

@highfly22 Good idea. One suggestion, We can add open method to dataPersistence and update your code to

ff = to_python_value_orig(ctx, lv, expected_python_type)
ff.open_auto = types.MethodType(ctx.file_access.open(uri), ff)
return ff

Feel free to open a PR for it ❤️

pingsutw avatar Aug 09 '22 08:08 pingsutw

This work has been superseded by the move to fsspec that happened around the Flyte 1.5 release.

eapolinario avatar Aug 07 '23 17:08 eapolinario