flyte
flyte copied to clipboard
[Core feature] Support S3FS like mountable file systems and S3 interchangably
Motivation: Why do you think this is important?
It is possible to mount S3 using S3FS. This would make it possible for users to simply write file system ops to read and write raw data and Flyte to read and write input / output metadata. Today this can be achieved using flytekitplugins-k8s (raw pods). Another method could be to allow platform maintainers to switch on support for a mountable file-system. In such a case Flyteplugins (pods) can simply translate the s3://... uris to local filesystem equivalents. This would also make it possible to easily interoperate between systems that do not support mountable file systems.
- [ ] Document how to use S3FS etc using pods
- [ ] Modify Flyteplugins to support this and document the protocol with a usecase sample
Goal: What should the final outcome look like, ideally?
- [ ] Users should have document to understand how these file systems can be mounted.
- [ ] Users with existing workflows should be able to transparently use S3FS without changing code
Describe alternatives you've considered
NA
Propose: Link/Inline OR Additional context
No response
Are you sure this issue hasn't been raised already?
- [X] Yes
Have you read the Code of Conduct?
- [X] Yes
I agree! we should look into K8s PVs and PVCs through CSIs (e.g. s3 csi) to offer this regardless of the target container language/SDK.
I agree! we should look into K8s PVs and PVCs through CSIs (e.g. s3 csi) to offer this regardless of the target container language/SDK.
I know you want this 😉. But I want to do it swappable on python first. That way we can say for other Languages this is the only supported method
A few things to note here. s3fs is uncomfortably slow and is not Posix compliant. In our case, we need a fast posix compliant shared file system. ObjectiveFS achieves this but it is a SaaS product and thus is not free to use. The middle ground I believe is to support access to the device /dev/fuse from inside Flyte task pod containers allowing users to create mounts with whatever backend they need. I achieved this by running the containers as privileged but it could be done more securely by exposing the host mount utilities and privileges if they exist, which we will do before moving these containers into production.
Our use case is mounting multiple terabytes of data for computational tasks. These are generally not workflow inputs but rather large supporting datasets. If we were to attach ebs volumes, we would need to create them on the fly or have a cache of volumes which is unwieldy and expensive. If we were to use efs, our costs would be 10x s3 and operations would be slow. Some fast shared file system is needed with posix being near necessary to trust the filesystem (https://github.com/s3fs-fuse/s3fs-fuse#limitations).
s3fs is uncomfortably slow and is not Posix compliant.
I agree this point. But the advantages of S3 are cheap and scalable. In our most case, the CPU/GPU is the bottleneck. We are happy with the performance of S3. The posix compliant is not mandatory. A file-like object is enough for data processing and training. If the throughput matters, the compression should mitigate the speed issue. The compression rate of most dataset files is between 1/2 to 1/20.
I try to subclass the FlyteFile, in case to read with S3FS skipping copy data to local filesystem. Auto-compression is another feature we like.
@highfly22 could you join slack.Flyte.org and help with ideas around this. Welcome to the community
@highfly22 could you join slack.Flyte.org and help with ideas around this. Welcome to the community
Sure. I wrote a gist to demo my idea.
https://gist.github.com/highfly22/54f9d976254ff373037aac34196dcaaf
cc @wild-endeavor / @pingsutw wdyt?
@highfly22 Good idea. One suggestion, We can add open method to dataPersistence and update your code to
ff = to_python_value_orig(ctx, lv, expected_python_type)
ff.open_auto = types.MethodType(ctx.file_access.open(uri), ff)
return ff
Feel free to open a PR for it ❤️
This work has been superseded by the move to fsspec that happened around the Flyte 1.5 release.