Fix: FlyteDirectory fails for multiple files with the same name
Tracking issue
Why are the changes needed?
Persisting FlyteDirectory doesn't work correctly when there are multiple files with the same name in the directory.
Let's have two files:
- /home/user/my_dir/file.txt
- /home/user/my_dir/subdir/file.txt (with identical content in both files)
And a trivial workflow:
@task()
def first_task(fd: fl.FlyteDirectory) -> None:
fd.download()
@fl.workflow
def my_workflow(fd: fl.FlyteDirectory) -> None:
first_task(fd)
Running with pyflyte run ... --fd /home/user/my_dir results in only file.txt being uploaded in the flyte S3 bucket. The file subdir/file.txt is skipped. This is of course not as expected.
In addition, note that if the files have a different content, this raises an exception in get_upload_signed_url!
USER:EntityAlreadyExists: error=None, cause=<_InactiveRpcError of RPC that terminated with:
status = StatusCode.ALREADY_EXISTS
details = "file already exists at location [s3://flyte/flytesnacks/development/ABSCCKRIMI35Z5XNGNQTO7FZIQ======/file.txt], specify a matching hash if you wish to rewrite"
debug_error_string = "UNKNOWN:Error received from peer {grpc_status:6, grpc_message:"file already exists at location [s3://flyte/flytesnacks/development/ABSCCKRIMI35Z5XNGNQTO7FZIQ======/file.txt], specify a matching hash if you wish to rewrite"}"
Our current workaround is to pack a directory into an archive and use FlyteFile.
What changes were proposed in this pull request?
This bugfix is based on the observation that get_upload_signed_url is only provided file basename and file content hash to determine the URL. This leads to the both files ending up in the same location (the directory is flatten). To fix that, I add the path relative to the local root to the filename. This seems to work and both files are uploaded.
Disclaimer: I have no idea if there are any side effects of this fix. It can be that some other functionality gets broken. Or there might be a better fix. I would appreciate a review from the maintainers:).
How was this patch tested?
Empirical test: For the minimal example above this works, the call to super()._put in FlyteFS._put() returns two objects: ['s3://flyte/flytesnacks/development/QLGCX23Y7HX3FI2LIEGVF5OW7Q======/subdir/file.txt', 's3://flyte/flytesnacks/development/QLGCX23Y7HX3FI2LIEGVF5OW7Q======/file.txt']. This works both when the file content is identical and when it's different.
Setup process
Screenshots
Check all the applicable boxes
- [ ] I updated the documentation accordingly.
- [ ] All new and existing tests passed.
- [ ] All commits are signed-off.
Related PRs
Docs link
Summary by Bito
This pull request fixes a bug in the FlyteDirectory persistence mechanism that prevented the upload of multiple files with the same name and content. By including the relative path of files in the upload process, it enhances the Flyte framework's ability to manage directories with duplicate files effectively.
Bito Automatic Review Failed - Technical Failure
Bito encountered technical difficulties while generating code feedback . To retry, type /review in a comment and save. If the issue persists, contact [email protected] and provide the following details:
Agent Run ID: c11d12ad-43fc-4e4e-8926-11ca82d446d0
Bito Automatic Review Failed - Technical Failure
Bito encountered technical difficulties while generating code feedback . To retry, type /review in a comment and save. If the issue persists, contact [email protected] and provide the following details:
Agent Run ID: f1aa79e7-8a74-4d65-85f4-69c4726bf53c
Bito Automatic Review Failed - Technical Failure
Bito encountered technical difficulties while generating code feedback . To retry, type /review in a comment and save. If the issue persists, contact [email protected] and provide the following details:
Agent Run ID: 5d0c7d53-4c07-4184-886c-62f22316e875
Codecov Report
:x: Patch coverage is 11.11111% with 8 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 75.76%. Comparing base (eb5a67f) to head (3d4e564).
:warning: Report is 21 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| flytekit/remote/remote_fs.py | 11.11% | 8 Missing :warning: |
:exclamation: There is a different number of reports uploaded between BASE (eb5a67f) and HEAD (3d4e564). Click for more details.
HEAD has 98 uploads less than BASE
Flag BASE (eb5a67f) HEAD (3d4e564) 100 2
Additional details and impacted files
@@ Coverage Diff @@
## master #3325 +/- ##
==========================================
- Coverage 85.26% 75.76% -9.50%
==========================================
Files 386 215 -171
Lines 30276 22504 -7772
Branches 2969 2954 -15
==========================================
- Hits 25814 17051 -8763
- Misses 3615 4583 +968
- Partials 847 870 +23
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
I'm not sure why the single test of attr_access_sd in integration_test_codecov is failing. Insights would be appreciated.