s4cmd icon indicating copy to clipboard operation
s4cmd copied to clipboard

Broken symlink

Open scottcupit opened this issue 6 years ago • 2 comments

I am trying to use s4cmd to backup a very large NFS mounted drive (AWS EFS) to AWS S3, 2.5TB. The server performing the backup has read-only access to this data and is on an isolated server just doing the backup. The NFS data has symlinks that are broken on the backup server, but exist on the production servers. When s4cmd comes across the broken symlink, it can't follow it, and the program dies.

I get this exception: Exception in thread Thread-2:hread(s)] Traceback (most recent call last): File "/usr/local/bin/s4cmd.py", line 520, in run self.class.dict[func_name](self, *args, **kargs) File "/usr/local/bin/s4cmd.py", line 129, in wrapper ret = func(*args, **kargs) File "/usr/local/bin/s4cmd.py", line 1317, in upload fsize = os.path.getsize(source) File "/usr/lib/python3.6/genericpath.py", line 50, in getsize return os.stat(filename).st_size FileNotFoundError: [Errno 2] No such file or directory: '/mnt/efs-prod/REDACTED'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/local/bin/s4cmd.py", line 529, in run fail('[OSError] %d: %s' % (e.errno, e.strerror)) File "/usr/local/bin/s4cmd.py", line 189, in fail raise RuntimeError(status) RuntimeError: 1

The program finishes about another 10 files, and then prints this exception, and then quits:

[Thread Failure] [Errno 2] No such file or directory: '/mnt/efs-prod/REDACTED' [Runtime Exception] 1 Traceback (most recent call last): File "/usr/local/bin/s4cmd.py", line 1928, in main CommandHandler(opt).run(args) File "/usr/local/bin/s4cmd.py", line 1557, in run CommandHandler.dict[cmd + '_handler'](self, args) File "/usr/local/bin/s4cmd.py", line 129, in wrapper ret = func(*args, **kargs) File "/usr/local/bin/s4cmd.py", line 1690, in dsync_handler self.s3handler().dsync_files(source, target) File "/usr/local/bin/s4cmd.py", line 129, in wrapper ret = func(*args, **kargs) File "/usr/local/bin/s4cmd.py", line 1004, in dsync_files pool.join() File "/usr/local/bin/s4cmd.py", line 594, in join self.tasks.join() File "/usr/local/bin/s4cmd.py", line 469, in join fail('[Thread Failure] ', exc_info=self.exc_info) File "/usr/local/bin/s4cmd.py", line 189, in fail raise RuntimeError(status) RuntimeError: 1

At this point, s4cmd has only processed about half of the tasks it reported in the beginning. I can confirm not all of the data made it to S3.

This is the command I am running: s4cmd dsync --recursive --force --sync-check --verbose /mnt/efs-prod/REDACTED s3://REDACTED

I am running Ubuntu 18.04 and s4cmd 2.1.0.

To install s4cmd, I ran the following commands: apt install python3-pip pip3 install s4cmd

Is there a workaround?

Is there an option to not follow symlinks?

Thanks in advance for your help!

scottcupit avatar Nov 02 '18 23:11 scottcupit

In case anyone comes across the same issue, here is the workaround I implemented.

Use this at your own risk. This works in my scenario, but I have not tested this solution for all possibilities.

In the file s4cmd.py, line 1315 for version 2.1.0 begins this code:

    # Initialization: Set up multithreaded uploads.
    if not mpi:
      fsize = os.path.getsize(source)
      md5cache = LocalMD5Cache(source)

      # optional checks
      if self.opt.dry_run:
        message('%s => %s', source, target)
        return
      elif self.opt.sync_check and self.sync_check(md5cache, obj):
        message('%s => %s (synced)', source, target)
        return
      elif not self.opt.force and obj:
        raise Failure('File already exists: %s' % target)

      if fsize < self.opt.max_singlepart_upload_size:
        data = self.read_file_chunk(source, 0, fsize)
        self.s3.put_object(Bucket=s3url.bucket,
                           Key=s3url.path,
                           Body=data,
                           Metadata={'md5': md5cache.get_md5(),
                                     'privilege': self.get_file_privilege(source)})
        message('%s => %s', source, target)
        return

      # Here we need to have our own md5 value because multipart upload calculates
      # different md5 values.
      response = self.s3.create_multipart_upload(Bucket=s3url.bucket,
                                                 Key=s3url.path,
                                                 Metadata={'md5': md5cache.get_md5(),
                                                           'privilege': self.get_file_privilege(source)})
      upload_id = response['UploadId']

      for args in self.get_file_splits(upload_id, source, target, fsize, self.opt.multipart_split_size):
        self.pool.upload(*args)
      return

I changed it to this code:

    # Initialization: Set up multithreaded uploads.
    if not mpi:
      if not os.path.exists(source):
        message('WARNING %s => %s (broken link)', source, target)
        return
      else:
        fsize = os.path.getsize(source)
        md5cache = LocalMD5Cache(source)

        # optional checks
        if self.opt.dry_run:
          message('%s => %s', source, target)
          return
        elif self.opt.sync_check and self.sync_check(md5cache, obj):
          message('%s => %s (synced)', source, target)
          return
        elif not self.opt.force and obj:
          raise Failure('File already exists: %s' % target)

        if fsize < self.opt.max_singlepart_upload_size:
          data = self.read_file_chunk(source, 0, fsize)
          self.s3.put_object(Bucket=s3url.bucket,
                             Key=s3url.path,
                             Body=data,
                             Metadata={'md5': md5cache.get_md5(),
                                       'privilege': self.get_file_privilege(source)})
          message('%s => %s', source, target)
          return

        # Here we need to have our own md5 value because multipart upload calculates
        # different md5 values.
        response = self.s3.create_multipart_upload(Bucket=s3url.bucket,
                                                   Key=s3url.path,
                                                   Metadata={'md5': md5cache.get_md5(),
                                                             'privilege': self.get_file_privilege(source)})
        upload_id = response['UploadId']

        for args in self.get_file_splits(upload_id, source, target, fsize, self.opt.multipart_split_size):
          self.pool.upload(*args)
        return

scottcupit avatar Nov 05 '18 20:11 scottcupit

Interesting bug. I'm not particularly sure if there is a non-hacky way of handling this, so I think making a flag based approach would be good. Also since it's a symlink specific fix , might make sense to pair this change alongside os.path.islink('file.lnk') to ensure it only affects symlinks (or maybe not).

Thanks for reporting this bug @scottcupit . 👍

FYI @rozuur

navinpai avatar Nov 06 '18 22:11 navinpai