aws-cli
aws-cli copied to clipboard
AWS S3 sync does not sync all the files
We have several hundred thousand files and S3 reliably syncs files. However, we have noticed that there were several files which were changed about a year ago and those are different but do not sync or update.
Both source and destination timestamps are also different but the sync never happens. S3 has the more recent file.
Command is as follows aws s3 s3://source /local-folder --delete
All the files that do not sync have the same date but are spread across multiple different folders.
Is there an S3 touch command to change the timestamp and possibly get the files to sync again?
You can possibly use --exact-timestamps to work around this, though that may result in excess uploads if you're uploading.
To help in reproducing, could you get me some information about one of the files that isn't syncing?
- What is the exact file size locally?
- What is the exact file size in S3?
- What is the last modified time locally?
- What is the last modified time in S3?
- Is the local file a symlink / behind a symlink?
Example command run aws s3 sync s3://bucket/ /var/www/folder/ --delete
Several files are missing Exact local size: 2625 Exact s3: 2625 Exact time stamp local: 06-Jan-2017 9:32:31 Exact time stamp s3: 20-Jun-2017 10:14:57 normal file in S3 and local
There are several cases like that in a list of around 50,000 files. However all the missing in sync are various times on 20 Jun 2017.
Using --exact-timestamps shows much more files to download although they are exactly the same contents. However they are still missing the ones in example above.
same issue here.
aws s3 sync dist/ s3://bucket --delete did not upload s3://bucket/index.html with dist/index.html
dist/index.html and s3://bucket/index.html have the same file size, but their modify time are different.
actually, some times awscli did upload the file, but some times not
Same here, --exact-timestamps doesn't help - index.html is not overwritten.
We experienced this issue was well today/last week. Again index.html is the same file size, but the contents and modified times are different.
Is anybody aware of a workaround for this?
I just ran into this. Same problem as reported by @icymind and @samdammers: the contents of my (local) index.html file had changed, but its file size was the same as the earlier copy in S3. The {{aws s3 sync}} command didn't upload it. My "workaround" was to delete index.html from S3, and then run the sync again (which then uploaded it as if it were a new file, I guess).
Server: EC2 linux
Version: aws-cli/1.16.108 Python/2.7.15 Linux/4.9.62-21.56.amzn1.x86_64 botocore/1.12.98
After aws s3 sync running over 270T of data I lost few GB of files. Sync didn't copy files with special characters at all.
Example of file /data/company/storage/projects/1013815/3.Company Estimates/B. Estimates
Had to use cp -R -n
same issue here xml file of the same size but different timestamp not synced correctly
I was able to reproduce this issue
bug.tar.gz download attached tar file and then
tar -zxvf bug.tar.gz
aws s3 sync a/ s3://<some-bucket-name>/<some_dir>/ --delete
aws s3 sync b/ s3://<some-bucket-name>/<some_dir>/ --delete
you'll see that even though repomd.xml in directories a and b differ in contents and timestamps attempting to sync b doesn't do anything
Tested on aws-cli/1.16.88 Python/2.7.15 Darwin/16.7.0 botocore/1.12.78 aws-cli/1.16.109 Python/2.7.5 Linux/3.10.0-693.17.1.el7.x86_64 botocore/1.12.99
im seeing the same issue. trying to sync a directory of files from s3 where one file was updated to a local directory. that file does not get updated in the local directory
I'm seeing this too. In my case it's a react app with index.html that refers to generated .js files. I'm syncing them with the --delete option to delete old files which are no longer referred to. The index.html is sometimes not uploaded, resulting in an old index.html which points to .js files which no longer exist.
Hence my website stops working !!!
I'm currently clueless as to why this is happening.
Does anyone have any ideas or workarounds ?
We have the same problem, but just found a workaround. I know, it is not the best way, but it works:
aws s3 cp s3://SRC s3://DEST ...
aws s3 sync s3://SRC s3://DEST ... --delete
It seems to us, that the copy is working fine, so first we copy after that we use the sync command to delete files, which are no longer present. Hope that the issue will be fixed asap.
I added --exact-timestamps to my pipeline and problem hasn't recurred. But, it was intermittent in the first place so I can't be sure it fixed it. If it happens again I'll go with @marns93 's suggestion.
We've met this problem and --exact-timestamps resolves our issue. I'm not sure if it's exactly the same problem.
I'm seeing this issue, and it's very obvious because each call only has to copy a handful (under a dozen) files.
The situation in which it happens is just like reported above: if the folder being synced into contains a file with different file contents but identical file size, sync will skip copying the new updated file from S3.
We ended up changing scripts to aws s3 cp --recursive to fix it, but this is a nasty bug -- for the longest time we thought we had some kind of race condition in our own application, not realizing that aws-cli was simply choosing not to copy the updated file(s).
I saw this as well with an html file
aws-cli/1.16.168 Python/3.6.0 Windows/2012ServerR2 botocore/1.12.158
I copy pasted the s3 sync command from a GitHub gist and it had --size-only set on it. Removing that fixed the problem!
Just ran into this issue with build artifacts being uploaded to a bucket. Our HTML tended to only change hash codes for asset links and so size was always the same. S3 sync was skipping these if the build was too soon after a previous one. Example:
10:01 - Build 1 runs 10:05 - Build 2 runs 10:06 - Build 1 is uploaded to s3 10:10 - Build 2 is uploaded to s3
Build 2 has HTML files with a timestamp of 10:05, however the HTML files uploaded to s3 by build 1 have a timestamp of 10:06 as that's when the objects were created. This results in them being ignored by s3 sync as remote files are "newer" than local files.
I'm now using s3 cp --recursive follow by s3 sync --delete as suggested earlier.
Hope this might be helpful to someone.
I had the same issue earlier this week; I was not using --size-only. Our index.html was different by a single character (. went to #), so the size was the same, but the timestamp on s3 was 40 minutes earlier than the timestamp of the new index.html. I deleted the index.html as a temporary workaround, but it's infeasible to double check every deployment.
The same here, files with the same name but with different timestamp and content are not synced from S3 to local and --delete does not help
We experience the same issue. An index.html with same size but newer timestamp is not copied.
This issue was reported over a year ago. Why is it not fixed?
Actually it makes the snyc command useless.
exact-time
--exact-timestamps fixed the issue
I am also effected by this issue. I added --exact-timestamps and the issue seemed to fix the files i was looking at. i have not done an exhaustive search. I have on the order of 100k files and 20gb, a lot less than the others in here.
I have faced the same issue, aws s3 sync skip some files, even with different contents and different dates. The log shows that those skipped files are synced but actually not.
But when I run aws s3 sync again, those files got synced. Very weird!
I had this issue when building a site with Hugo and I finally figured it out. I use submodules for my Hugo theme and was not pulling them down on CI. This was causing warnings in Hugo but not failures.
# On local
| EN
-------------------+-----
Pages | 16
Paginator pages | 0
Non-page files | 0
Static files | 7
Processed images | 0
Aliases | 7
Sitemaps | 1
Cleaned | 0
# On CI
| EN
-------------------+-----
Pages | 7
Paginator pages | 0
Non-page files | 0
Static files | 2
Processed images | 0
Aliases | 0
Sitemaps | 1
Cleaned | 0
Once I updated the submodules everything worked as expected.
We've also been affected by this issue, so much so that a platform went down for ~18 hours after a new vendor/autoload.php file didn't sync, and was out of date with vendor/composer/autoload_real.php so the whole app couldn't load.
This is a very strange problem, and I can't believe the issue has been open for this long.
Why would a sync not use hashes instead of last modified? Makes 0 sense.
For future Googlers, a redacted error I was getting:
---
PHP message: PHP Fatal error: Uncaught Error: Class 'ComposerAutoloaderInitXXXXXXXXXXXXX' not found in /xxx/xxx/vendor/autoload.php:7
Stack trace:
#0 /xxx/xxx/bootstrap/app.php(3): require_once()
#1 /xxx/xxx/public/index.php(14): require('/xxx/xxx...')
#2 {main}
thrown in /xxx/xxx/vendor/autoload.php on line 7" while reading response header from upstream: ...
---
The same problem, not all files are synced, --exact-timestamps didn't help.
aws --version
aws-cli/1.18.22 Python/2.7.13 Linux/4.14.152-127.182.amzn2.x86_64 botocore/1.15.22
I cannot believe this ticket open so long ... same problem here, where is Amazon's customer obsession?
I can't believe this ticket was not closed some time ago. As far as I can tell, it works as designed, but users (including me) make assumptions about how it should work and are then surprised when it doesn't behave how they expected.
- When a file is synced or copied to s3, the timestamp it receives on the bucket is the date it was copied, which is always newer than the date of the source file. This is just how s3 works.
- Files are only synced if the size changes, or the timestamp on the target is older than the source.
- This means that if source files are updated but the size of the files remains unchanged and the dates on those changed files pre-date when they were last copied, s3 sync will not sync them again.
- Using
--exact-timestampsonly works when copying from s3 to local. It is deliberately not enabled for local to s3 because the timestamps are never equal. So setting it when syncing from local to s3 has no effect. - I don't think s3 calculates hashes for uploaded files, so there's no way of avoiding file size and last uploaded date as checks.
Bottom line is that it works as intended, but there are various use cases where this is not desirable. As mentioned above I've worked around it using s3 cp --recursive
@jam13 thanks for the explanation, now it all makes sense in hindsight!
Nevertheless, I'd argue that it's currently poorly documented (I would have expected a fat red warning in the documentation stating that --exact-timestamps only works from s3 to local and also for the s3 cli to just bail out instead of silently ignoring the parameter) and an optional hash-based comparison mode is necessary to implement a reliably working synchronisation mode.