aws-cli icon indicating copy to clipboard operation
aws-cli copied to clipboard

AWS S3 sync does not sync all the files

Open webdigi opened this issue 7 years ago • 80 comments

We have several hundred thousand files and S3 reliably syncs files. However, we have noticed that there were several files which were changed about a year ago and those are different but do not sync or update.

Both source and destination timestamps are also different but the sync never happens. S3 has the more recent file.

Command is as follows aws s3 s3://source /local-folder --delete

All the files that do not sync have the same date but are spread across multiple different folders.

Is there an S3 touch command to change the timestamp and possibly get the files to sync again?

webdigi avatar Apr 18 '18 16:04 webdigi

You can possibly use --exact-timestamps to work around this, though that may result in excess uploads if you're uploading.

To help in reproducing, could you get me some information about one of the files that isn't syncing?

  • What is the exact file size locally?
  • What is the exact file size in S3?
  • What is the last modified time locally?
  • What is the last modified time in S3?
  • Is the local file a symlink / behind a symlink?

JordonPhillips avatar Apr 27 '18 17:04 JordonPhillips

Example command run aws s3 sync s3://bucket/ /var/www/folder/ --delete

Several files are missing Exact local size: 2625 Exact s3: 2625 Exact time stamp local: 06-Jan-2017 9:32:31 Exact time stamp s3: 20-Jun-2017 10:14:57 normal file in S3 and local

There are several cases like that in a list of around 50,000 files. However all the missing in sync are various times on 20 Jun 2017.

Using --exact-timestamps shows much more files to download although they are exactly the same contents. However they are still missing the ones in example above.

webdigi avatar Apr 30 '18 15:04 webdigi

same issue here. aws s3 sync dist/ s3://bucket --delete did not upload s3://bucket/index.html with dist/index.html

dist/index.html and s3://bucket/index.html have the same file size, but their modify time are different.

actually, some times awscli did upload the file, but some times not

overcache avatar Jul 03 '18 07:07 overcache

Same here, --exact-timestamps doesn't help - index.html is not overwritten.

zyv avatar Jul 26 '18 12:07 zyv

We experienced this issue was well today/last week. Again index.html is the same file size, but the contents and modified times are different.

samdammers avatar Aug 20 '18 06:08 samdammers

Is anybody aware of a workaround for this?

stephram avatar Aug 28 '18 03:08 stephram

I just ran into this. Same problem as reported by @icymind and @samdammers: the contents of my (local) index.html file had changed, but its file size was the same as the earlier copy in S3. The {{aws s3 sync}} command didn't upload it. My "workaround" was to delete index.html from S3, and then run the sync again (which then uploaded it as if it were a new file, I guess).

lylejohnson avatar Sep 07 '18 21:09 lylejohnson

Server: EC2 linux Version: aws-cli/1.16.108 Python/2.7.15 Linux/4.9.62-21.56.amzn1.x86_64 botocore/1.12.98


After aws s3 sync running over 270T of data I lost few GB of files. Sync didn't copy files with special characters at all.

Example of file /data/company/storage/projects/1013815/3.Company Estimates/B. Estimates

Had to use cp -R -n

smxdevst avatar Feb 20 '19 19:02 smxdevst

same issue here xml file of the same size but different timestamp not synced correctly

I was able to reproduce this issue

bug.tar.gz download attached tar file and then

tar -zxvf bug.tar.gz
aws s3 sync a/ s3://<some-bucket-name>/<some_dir>/ --delete
aws s3 sync b/ s3://<some-bucket-name>/<some_dir>/ --delete

you'll see that even though repomd.xml in directories a and b differ in contents and timestamps attempting to sync b doesn't do anything

Tested on aws-cli/1.16.88 Python/2.7.15 Darwin/16.7.0 botocore/1.12.78 aws-cli/1.16.109 Python/2.7.5 Linux/3.10.0-693.17.1.el7.x86_64 botocore/1.12.99

checkmypi avatar Feb 21 '19 12:02 checkmypi

im seeing the same issue. trying to sync a directory of files from s3 where one file was updated to a local directory. that file does not get updated in the local directory

chrispruitt avatar Mar 07 '19 18:03 chrispruitt

I'm seeing this too. In my case it's a react app with index.html that refers to generated .js files. I'm syncing them with the --delete option to delete old files which are no longer referred to. The index.html is sometimes not uploaded, resulting in an old index.html which points to .js files which no longer exist.

Hence my website stops working !!!

I'm currently clueless as to why this is happening.

Does anyone have any ideas or workarounds ?

lqueryvg avatar Mar 15 '19 18:03 lqueryvg

We have the same problem, but just found a workaround. I know, it is not the best way, but it works:

aws s3 cp s3://SRC s3://DEST ...
aws s3 sync s3://SRC s3://DEST ... --delete

It seems to us, that the copy is working fine, so first we copy after that we use the sync command to delete files, which are no longer present. Hope that the issue will be fixed asap.

marns93 avatar Mar 27 '19 07:03 marns93

I added --exact-timestamps to my pipeline and problem hasn't recurred. But, it was intermittent in the first place so I can't be sure it fixed it. If it happens again I'll go with @marns93 's suggestion.

lqueryvg avatar Mar 27 '19 09:03 lqueryvg

We've met this problem and --exact-timestamps resolves our issue. I'm not sure if it's exactly the same problem.

JasonQSY avatar Mar 31 '19 01:03 JasonQSY

I'm seeing this issue, and it's very obvious because each call only has to copy a handful (under a dozen) files.

The situation in which it happens is just like reported above: if the folder being synced into contains a file with different file contents but identical file size, sync will skip copying the new updated file from S3.

We ended up changing scripts to aws s3 cp --recursive to fix it, but this is a nasty bug -- for the longest time we thought we had some kind of race condition in our own application, not realizing that aws-cli was simply choosing not to copy the updated file(s).

elliot-nelson avatar May 01 '19 17:05 elliot-nelson

I saw this as well with an html file

aws-cli/1.16.168 Python/3.6.0 Windows/2012ServerR2 botocore/1.12.158

benjamin-issa avatar Jun 27 '19 18:06 benjamin-issa

I copy pasted the s3 sync command from a GitHub gist and it had --size-only set on it. Removing that fixed the problem!

nabilfreeman avatar Sep 16 '19 21:09 nabilfreeman

Just ran into this issue with build artifacts being uploaded to a bucket. Our HTML tended to only change hash codes for asset links and so size was always the same. S3 sync was skipping these if the build was too soon after a previous one. Example:

10:01 - Build 1 runs 10:05 - Build 2 runs 10:06 - Build 1 is uploaded to s3 10:10 - Build 2 is uploaded to s3

Build 2 has HTML files with a timestamp of 10:05, however the HTML files uploaded to s3 by build 1 have a timestamp of 10:06 as that's when the objects were created. This results in them being ignored by s3 sync as remote files are "newer" than local files.

I'm now using s3 cp --recursive follow by s3 sync --delete as suggested earlier.

Hope this might be helpful to someone.

jam13 avatar Sep 23 '19 11:09 jam13

I had the same issue earlier this week; I was not using --size-only. Our index.html was different by a single character (. went to #), so the size was the same, but the timestamp on s3 was 40 minutes earlier than the timestamp of the new index.html. I deleted the index.html as a temporary workaround, but it's infeasible to double check every deployment.

jay-w-jensen avatar Oct 02 '19 14:10 jay-w-jensen

The same here, files with the same name but with different timestamp and content are not synced from S3 to local and --delete does not help

sabretus avatar Oct 11 '19 17:10 sabretus

We experience the same issue. An index.html with same size but newer timestamp is not copied.

This issue was reported over a year ago. Why is it not fixed?

Actually it makes the snyc command useless.

magraeber avatar Oct 16 '19 07:10 magraeber

exact-time

--exact-timestamps fixed the issue

Rimce avatar Nov 12 '19 21:11 Rimce

I am also effected by this issue. I added --exact-timestamps and the issue seemed to fix the files i was looking at. i have not done an exhaustive search. I have on the order of 100k files and 20gb, a lot less than the others in here.

tompetrillo avatar Jan 28 '20 06:01 tompetrillo

I have faced the same issue, aws s3 sync skip some files, even with different contents and different dates. The log shows that those skipped files are synced but actually not. But when I run aws s3 sync again, those files got synced. Very weird!

jason-beijing avatar Jan 29 '20 03:01 jason-beijing

I had this issue when building a site with Hugo and I finally figured it out. I use submodules for my Hugo theme and was not pulling them down on CI. This was causing warnings in Hugo but not failures.

# On local
                   | EN
-------------------+-----
  Pages            | 16
  Paginator pages  |  0
  Non-page files   |  0
  Static files     |  7
  Processed images |  0
  Aliases          |  7
  Sitemaps         |  1
  Cleaned          |  0

# On CI
                   | EN  
-------------------+-----
  Pages            |  7  
  Paginator pages  |  0  
  Non-page files   |  0  
  Static files     |  2  
  Processed images |  0  
  Aliases          |  0  
  Sitemaps         |  1  
  Cleaned          |  0  

Once I updated the submodules everything worked as expected.

cbelsole avatar Feb 23 '20 13:02 cbelsole

We've also been affected by this issue, so much so that a platform went down for ~18 hours after a new vendor/autoload.php file didn't sync, and was out of date with vendor/composer/autoload_real.php so the whole app couldn't load.

This is a very strange problem, and I can't believe the issue has been open for this long.

Why would a sync not use hashes instead of last modified? Makes 0 sense.

For future Googlers, a redacted error I was getting:

---
PHP message: PHP Fatal error:  Uncaught Error: Class 'ComposerAutoloaderInitXXXXXXXXXXXXX' not found in /xxx/xxx/vendor/autoload.php:7
Stack trace:
#0 /xxx/xxx/bootstrap/app.php(3): require_once()
#1 /xxx/xxx/public/index.php(14): require('/xxx/xxx...')
#2 {main}
  thrown in /xxx/xxx/vendor/autoload.php on line 7" while reading response header from upstream: ...
---

darrynten avatar Mar 11 '20 08:03 darrynten

The same problem, not all files are synced, --exact-timestamps didn't help.

aws --version
aws-cli/1.18.22 Python/2.7.13 Linux/4.14.152-127.182.amzn2.x86_64 botocore/1.15.22

applerom avatar Mar 18 '20 13:03 applerom

I cannot believe this ticket open so long ... same problem here, where is Amazon's customer obsession?

bobye avatar Mar 21 '20 20:03 bobye

I can't believe this ticket was not closed some time ago. As far as I can tell, it works as designed, but users (including me) make assumptions about how it should work and are then surprised when it doesn't behave how they expected.

  • When a file is synced or copied to s3, the timestamp it receives on the bucket is the date it was copied, which is always newer than the date of the source file. This is just how s3 works.
  • Files are only synced if the size changes, or the timestamp on the target is older than the source.
  • This means that if source files are updated but the size of the files remains unchanged and the dates on those changed files pre-date when they were last copied, s3 sync will not sync them again.
  • Using --exact-timestamps only works when copying from s3 to local. It is deliberately not enabled for local to s3 because the timestamps are never equal. So setting it when syncing from local to s3 has no effect.
  • I don't think s3 calculates hashes for uploaded files, so there's no way of avoiding file size and last uploaded date as checks.

Bottom line is that it works as intended, but there are various use cases where this is not desirable. As mentioned above I've worked around it using s3 cp --recursive

jam13 avatar Mar 23 '20 10:03 jam13

@jam13 thanks for the explanation, now it all makes sense in hindsight!

Nevertheless, I'd argue that it's currently poorly documented (I would have expected a fat red warning in the documentation stating that --exact-timestamps only works from s3 to local and also for the s3 cli to just bail out instead of silently ignoring the parameter) and an optional hash-based comparison mode is necessary to implement a reliably working synchronisation mode.

zyv avatar Mar 23 '20 11:03 zyv