flytekit icon indicating copy to clipboard operation
flytekit copied to clipboard

Fix: FlyteDirectory fails for multiple files with the same name

Open mys007 opened this issue 3 months ago • 5 comments

Tracking issue

Why are the changes needed?

Persisting FlyteDirectory doesn't work correctly when there are multiple files with the same name in the directory.

Let's have two files:

  • /home/user/my_dir/file.txt
  • /home/user/my_dir/subdir/file.txt (with identical content in both files)

And a trivial workflow:

@task()
def first_task(fd: fl.FlyteDirectory) -> None:
    fd.download()

@fl.workflow
def my_workflow(fd: fl.FlyteDirectory) -> None:
   first_task(fd)

Running with pyflyte run ... --fd /home/user/my_dir results in only file.txt being uploaded in the flyte S3 bucket. The file subdir/file.txt is skipped. This is of course not as expected.

In addition, note that if the files have a different content, this raises an exception in get_upload_signed_url!

USER:EntityAlreadyExists: error=None, cause=<_InactiveRpcError of RPC that terminated with:
        status = StatusCode.ALREADY_EXISTS
        details = "file already exists at location [s3://flyte/flytesnacks/development/ABSCCKRIMI35Z5XNGNQTO7FZIQ======/file.txt], specify a matching hash if you wish to rewrite"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_status:6, grpc_message:"file already exists at location [s3://flyte/flytesnacks/development/ABSCCKRIMI35Z5XNGNQTO7FZIQ======/file.txt], specify a matching hash if you wish to rewrite"}"

Our current workaround is to pack a directory into an archive and use FlyteFile.

What changes were proposed in this pull request?

This bugfix is based on the observation that get_upload_signed_url is only provided file basename and file content hash to determine the URL. This leads to the both files ending up in the same location (the directory is flatten). To fix that, I add the path relative to the local root to the filename. This seems to work and both files are uploaded.

Disclaimer: I have no idea if there are any side effects of this fix. It can be that some other functionality gets broken. Or there might be a better fix. I would appreciate a review from the maintainers:).

How was this patch tested?

Empirical test: For the minimal example above this works, the call to super()._put in FlyteFS._put() returns two objects: ['s3://flyte/flytesnacks/development/QLGCX23Y7HX3FI2LIEGVF5OW7Q======/subdir/file.txt', 's3://flyte/flytesnacks/development/QLGCX23Y7HX3FI2LIEGVF5OW7Q======/file.txt']. This works both when the file content is identical and when it's different.

Setup process

Screenshots

Check all the applicable boxes

  • [ ] I updated the documentation accordingly.
  • [ ] All new and existing tests passed.
  • [ ] All commits are signed-off.

Related PRs

Docs link

Summary by Bito

This pull request fixes a bug in the FlyteDirectory persistence mechanism that prevented the upload of multiple files with the same name and content. By including the relative path of files in the upload process, it enhances the Flyte framework's ability to manage directories with duplicate files effectively.

mys007 avatar Sep 02 '25 23:09 mys007

Bito Automatic Review Failed - Technical Failure

Bito encountered technical difficulties while generating code feedback . To retry, type /review in a comment and save. If the issue persists, contact [email protected] and provide the following details:

Agent Run ID: c11d12ad-43fc-4e4e-8926-11ca82d446d0

flyte-bot avatar Sep 03 '25 00:09 flyte-bot

Bito Automatic Review Failed - Technical Failure

Bito encountered technical difficulties while generating code feedback . To retry, type /review in a comment and save. If the issue persists, contact [email protected] and provide the following details:

Agent Run ID: f1aa79e7-8a74-4d65-85f4-69c4726bf53c

flyte-bot avatar Sep 03 '25 08:09 flyte-bot

Bito Automatic Review Failed - Technical Failure

Bito encountered technical difficulties while generating code feedback . To retry, type /review in a comment and save. If the issue persists, contact [email protected] and provide the following details:

Agent Run ID: 5d0c7d53-4c07-4184-886c-62f22316e875

flyte-bot avatar Sep 03 '25 23:09 flyte-bot

Codecov Report

:x: Patch coverage is 11.11111% with 8 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 75.76%. Comparing base (eb5a67f) to head (3d4e564). :warning: Report is 21 commits behind head on master.

Files with missing lines Patch % Lines
flytekit/remote/remote_fs.py 11.11% 8 Missing :warning:

:exclamation: There is a different number of reports uploaded between BASE (eb5a67f) and HEAD (3d4e564). Click for more details.

HEAD has 98 uploads less than BASE
Flag BASE (eb5a67f) HEAD (3d4e564)
100 2
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3325      +/-   ##
==========================================
- Coverage   85.26%   75.76%   -9.50%     
==========================================
  Files         386      215     -171     
  Lines       30276    22504    -7772     
  Branches     2969     2954      -15     
==========================================
- Hits        25814    17051    -8763     
- Misses       3615     4583     +968     
- Partials      847      870      +23     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Sep 05 '25 12:09 codecov[bot]

I'm not sure why the single test of attr_access_sd in integration_test_codecov is failing. Insights would be appreciated.

mys007 avatar Sep 05 '25 20:09 mys007