restic icon indicating copy to clipboard operation
restic copied to clipboard

restic fails on repo mounted via CIFS/samba on Linux using go 1.14 build

Open MichaelEischer opened this issue 6 years ago • 41 comments

This issue is a summary of https://forum.restic.net/t/prune-fails-on-cifs-repo-using-go-1-14-build/2579 , intended as a reference for the underlying problem.

tl;dr Workaround: Setting the environment variable GODEBUG to asyncpreemptoff=1 restores the pre Go 1.14 behavior and fixes the problem.

Output of restic version

restic 0.9.6 (v0.9.6-137-gc542a509) compiled with go1.14 on linux/amd64 Linux Kernel version: 5.5.9

How did you run restic exactly?

restic prune -r /path/to/repository/on/CIFS/share

Relevant log excerpt:

Load(<data/216ea5f2d2>, 591, 4564169) returned error, retrying after 720.254544ms: open /mnt/nas/redacted/reponame/data/21/216ea5f2d21b458a7b913609ddef2a6ac4788b4bad5481b2916558d2ce1bef04: interrupted system call

Prune failed in the end

Further relevant log excerpts:

Load(<data/2e9db0642e>, 591, 4758136) returned error, retrying after 552.330144ms: read /mnt/nas/redacted/reponame/data/2e/2e9db0642e0fb67b959aa1d91c0d70daa8331ad246c5eeb8582ba2a14f24680f: interrupted system call
List(data) returned error, retrying after 282.818509ms: lstat /mnt/nas/redacted/reponame/data/64: interrupted system call
List(data) returned error, retrying after 492.389441ms: readdirent: interrupted system call
Save(<data/f0f5102554>) returned error, retrying after 552.330144ms: chmod /mnt/nas/redacted/reponame/data/f0/f0f51025542c0287943ef3816e642586be46ae10dc9efbcfa7b305d9e093dbd4: interrupted system call

What backend/server/service did you use to store the repository?

Local backend stored on a CIFS share

Expected behavior

No warnings, prune should complete.

Actual behavior

Prune failed.

Steps to reproduce the behavior

Build restic using Go 1.14 and store the backup repository on a CIFS share.

Do you have any idea what may have caused this?

This issue is a side effect of asynchronous preemptions in go 1.14. The release notes state the following:

This means that programs that use packages like syscall or golang.org/x/sys/unix will see more slow system calls fail with EINTR errors. Those programs will have to handle those errors in some way, most likely looping to try the system call again.

Go configures signal handlers to restart syscalls if possible. The standard library also retries syscalls when necessary. That is there should only be issues when directly calling low-level syscalls and in that case one should just implement things properly. However, restic just uses go standard library functions that should already handle EINTR if necessary.

The first prune error message points to an os.Open call (via fs.Open) in the Load function of the local backend. So it looks like a Go standard library call fails. However, the manpage for signal (man 7 signal) states that the open syscall, that is called underneath, is always restarted when using SA_RESTART as is done by Go. So this seems to be a bug in the Linux kernel. Adding a loop around the call to fs.Open to repeat it as long as EINTR is returned, fixes that one call. Fixing all problematic calls would end up adding lots of ugly loops and playing whack-a-mole.

The manpages of lstat, readdir and chmod don't even list EINTR as a possible errno.

Do you have an idea how to solve the issue?

Setting the environment variable GODEBUG to asyncpreemptoff=1 restores the pre Go 1.14 behavior and fixes the problem.

Go relies on the assumption that the kernel properly restarts syscalls when told to do so. As that latter is obviously not the case, the proper fix would be to submit a bug report to the linux kernel.

A short-term solution would be to add a note to the restic documentation that mentions the compatibility problem with CIFS mounts.

MichaelEischer avatar Mar 21 '20 13:03 MichaelEischer

Hi, I can confirm that, I was unable to make a 400GB backup on a Samba share. It's reproducible (I tried three times) When I do a check it finds errors if I do a rebuild-index after the check generates error pages and repository become crashed. I thought it was my Samba configuration or the destination disk so I copied large amounts of data without any problems. And now I use the rest-server and it works. So there is a problem around Samba. Note I tried with and without export GODEBUG=asyncpreemptoff=1 without it crashes after a second. I also don't see any errors in journalctl server and client.

In that configuration the data and restic is on local disk and restic destination repository on Samba share disk. Something like "samba write"

go version go1.15.2 linux/amd64 restic 0.10.0 compiled with go1.15.2 on linux/amd64


I did another backup from a music server (bluesound) where the data are remote and restic and repo are local Something like "Samba read". I got (without GODEBUG=asyncpreemptoff=1 ) few warning but the repo check was ok. With GODEBUG=asyncpreemptoff=1 I got no problems.

restic 0.10.0 compiled with go1.15.2 on linux/arm64 go version go1.15.2 linux/arm64

rbolog avatar Oct 04 '20 21:10 rbolog

Reported by @Mikescher in #2968

Summary

After upgrading to restic 0.10.0 I get read errors when backing up a mounted samba share. After downgrading to 0.9.6 the errors are once again gone.

Output of restic version

restic 0.10.0 compiled with go1.15.2 on linux/amd64   (before)
restic 0.9.6 compiled with go1.13.4 on linux/amd64    (after downgrade)

How did you run restic exactly?

I have a local restic repository and a mounted samba/cifs share (from my NAS) that I want to backup. I simply run restic -r {local_repo} backup /mnt/nas_hektor --cleanup-cache --exclude-file {exclusions}.txt

Output:

repository ac08473e opened successfully, password is correct
error: NodeFromFileInfo: Listxattr: xattr.list /mnt/nas_hektor/{...} : interrupted system call
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: lstat /mnt/nas_hektor/{...}: interrupted system call
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: NodeFromFileInfo: Listxattr: xattr.list /mnt/nas_hektor/{...} : interrupted system call
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: NodeFromFileInfo: Listxattr: xattr.list /mnt/nas_hektor/{...} : interrupted system call
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: NodeFromFileInfo: Listxattr: xattr.list /mnt/nas_hektor/{...} : interrupted system call
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: Readdirnames /mnt/nas_hektor/{...} failed: readdirent: no such file or directory
error: NodeFromFileInfo: Listxattr: xattr.list /mnt/nas_hektor/{...} : interrupted system call
error: NodeFromFileInfo: Listxattr: xattr.list /mnt/nas_hektor/{...} : interrupted system call

Files:         834 new,     0 changed, 17209 unmodified
Dirs:           46 new,  1191 changed,     0 unmodified
Added to the repo: 251.044 GiB

processed 18043 files, 8.624 TiB in 3:40:12
snapshot db1a5445 saved
Warning: failed to read all source data during backup

The /var/log/syslog is sprinkled many of the following (seemingly relevant) errors:

Oct  2 01:42:04 omv-{name} kernel: [21401.666794] CIFS VFS: Send error in read = -4
Oct  2 01:05:45 omv-{name} kernel: [19223.167367] CIFS VFS: \\{nas}.fritz.box\Hektor Close interrupted close

What backend/server/service did you use to store the repository?

The repository is simply stored on the local filesystem, only the actual to-be-backed-up data is mounted via cifs

Expected behavior

Well, no errors..

Actual behavior

restic displays errors while running, ends with an error message and actually skips these files in the snapshot. Calling restic again backs the files up without problem (I can also without problem stat/list the files/directories in my terminal). It seems to just fail at random and after 3 invocations of restic backup I had all files backed up (after that I downgraded for the other backups I wanted to do)

Steps to reproduce the behavior

Not really sure if it's reproducible for everyone who backs up (a lot of) data from a cifs mountpoint. I tried to eliminate other problem sources (see below) but I don't know how much it depends on my setup/network/whatever

Do you have any idea what may have caused this?

No idea. I can say that I tried a lot of other things before downgrading restic because it really sounds like a samba problem. I did a full software update on my machine, restarted my synology nas, multiple times restarted my machine, tried tweaking samba mount options etc. But because a simple restic downgrade eliminated all problems it seems like restic at least started triggering some cifs bug since 0.10.0

Btw both machines are in the same local network, connected by a Gbit switch and stand only a meter from each other, so teh network should be very stable.

Do you have an idea how to solve the issue?

Nope, sorry.

Did restic help you today? Did it make you happy in any way?

Well not happy but I'm still convinced that you simply managed to trigger some stupid bug in mount.cifs and because I know how annoying such stuff is I feel kinda sorry to report it 😨

MichaelEischer avatar Oct 07 '20 09:10 MichaelEischer

Analysis by @greatroar in #2968:

Oct  2 01:42:04 omv-{name} kernel: [21401.666794] CIFS VFS: Send error in read = -4

That's an EINTR.

This is might be the new async preemption in the Go 1.14 runtime. The best way to tell would be to compile restic 0.10.0 with Go 1.13.4 and try again. But before you go installing compilers, maybe try setting ~GOMAXPROCS=1~ GODEBUG=asyncpreemptoff=1:

$ export GODEBUG=asyncpreemptoff=1
$ restic backup /mnt/nas_hektor --cleanup-cache --exclude-file exclusions.txt

Related: golang/go#39237, golang/go#40870 (especially this comment).

I think we ~should~ could put an EINTR loop around xattr, or maybe get the maintainers of pkg/xattr to do that. As for the Readdirnames failures, the Go people have decided that's a kernel bug that they're not going to work around, unless someone shows that it isn't.

MichaelEischer avatar Oct 07 '20 09:10 MichaelEischer

The consensus among maintainers (expressed at #3061) is to wait until Go 1.16, which will include a thorough fix in the stdlib. If that's right, may I suggest adding a summary of/link to the workaround at the top and pinning this issue? It comes up regularly and will continue to do so until ca. February.

greatroar avatar Nov 06 '20 10:11 greatroar

I'll pin this issue for now

fd0 avatar Nov 06 '20 10:11 fd0

Sorry if this isn't related to the same underlying Go issues, but I'm seeing an error when I try to back up under Windows 10 20H2 with restic 0.11:

repository 94fe90ee opened successfully, password is correct
creating VSS snapshot for [c:\]
error: failed to create snapshot for [c:\]: VSS error: async operation pending

Restic continues working and backs up all files that it can access, but obviously those in use are inaccessible because it doesn't believe it has a working shadow copy. However, the shadow copy was created by the OS and is available while restic runs.

JPPhoto avatar Nov 27 '20 17:11 JPPhoto

@JPPhoto Please open a separate issue and fill out the issue form. This issue is exclusively about a Linux specific issue with Samba.

MichaelEischer avatar Nov 30 '20 22:11 MichaelEischer

using GODEBUG and related nouser_xattr from https://github.com/restic/restic/issues/1800 helped me fix the problems in my script (https://forum.restic.net/t/error-lstat-errno-524-during-backup/3272)

uok avatar Dec 06 '20 12:12 uok

The GODEBUG workaround helped me fix my backup. I use restic to bacup my NAS to object storage and this problem got me from making backups for months. export GODEBUG=asyncpreemptoff=1

CyberKiller40 avatar Jan 25 '21 16:01 CyberKiller40

Go 1.16 has been released in the meantime. Could anyone try whether this finally solves the compatibility problems with CIFS on Linux?

MichaelEischer avatar Feb 27 '21 23:02 MichaelEischer

Unfortunately, the problem is not solved. In fact it seems (if anything) to have been replaced by something worse.

Using

restic 0.12.0 (v0.12.0-11-g814a399e4) compiled with go1.16 on linux/amd64

and with a clean copy of a repo similar to the one in my original forum post, I first ran restic forget --prune --max-unused=0. This appeared to complete normally and there were no errors logged. Paranoid person that I am, however, I then ran restic check on the pruned repo. This produced a series of errors ending with

 error: error loading index 55d2df1b: load <index/55d2df1b4a>: invalid data returned
 error: error loading index 594fd4a0: load <index/594fd4a025>: invalid data returned
 Fatal: LoadIndex returned errors 

So I tried restic rebuild-index (which in the old days of the Go 1.14 problems had effectively repaired the repo). This produced multiple errors of the form

reindexing pack file ced69db1d699800b56f73678b8974c8487f114ae02f220c08561ecc9cf8cb38f with unexpected size 0 instead of 4250647
reindexing pack file 73c818e87931931b1d4721a99e57148c54560ff44d420956e1da9f04be73bf3d with unexpected size 4250647 instead of 420237

A second attempt to rebuild the index resulted in multiple lines of the form

adding pack file to index ced69db1d699800b56f73678b8974c8487f114ae02f220c08561ecc9cf8cb38f
adding pack file to index 3c350cd1f0d6b938793bdd57a4180264f8b00e642783c10a633e630e05dd240a
adding pack file to index d34953733e994019a21c73ad699cc7950af2c91f9d5d4a05ef726f63b658aa44

There was no explicit indication that either rebuild-index had failed but a restic check produced several hundred error messages ending with

tree d13ef09a: file "b98af7035914f872_0" blob e7b7edb0ef1394d41bc61c9e485e77192ff326500556c36ae0613ff30c9976bf not found in index
 tree d13ef09a: file "e1e489cb2437551c_0" blob 16630545bb3558e2229696fa5a98c17f8ffe9599e317dde28674481bc3739a2c not found in index
[0:07] 100.00%  37 / 37 snapshots
Fatal: repository contains errors

All of the above was with GODEBUG unset. However, at no time did I see any of the "interrupted system call" messages previously characteristic of the issue.

At this point I gave up. I thought at first that (since I had to reestablish my CIFS configuration to do this test) I might have fouled up access permissions or something similar. So (again starting with a clean copy of the repo, obviously) I repeated the above sequence of events using

restic 0.9.6 compiled with go1.14.4 on linux/amd64

and GODEBUG=asyncpreemptoff=1 as before. Worked perfectly (no errors).

Finally, just to take my Go 1.16 restic build out of the equation, I reran the original sequence of commands on a clean copy of the repo with the filesystem mounted over NFS (instead of CIFS). Again, everything worked perfectly.

Possibly the original problem is indeed fixed but the fix introduced another issue?

underhillian avatar Mar 07 '21 17:03 underhillian

Unfortunately, the problem is not solved. In fact it seems (if anything) to have been replaced by something worse.

That seems indeed to be the case. I've encountered a similar (?) issue with Go 1.16.1 on Ubuntu 20.04 running via WSL (Linux-Kernel 4.19.128) with a local samba share (a plain Linux system/VM should work just as well). For testing I've created a new repository and backed up a few gigabytes of data. For me the issues so far were slow operations (at least I had that impression) and the silent corruption of a pack file! The pack file ended up with a few 64kb ranges which just contain null bytes instead of the actual data.

So this looks like either a Go or a kernel problem. So far I haven't had time assemble a more compact test case for reproducing.

MichaelEischer avatar Mar 16 '21 21:03 MichaelEischer

So far I haven't had time assemble a more compact test case for reproducing.

I have a small test repo (a few text files, a few snapshots, <1.5Mb) that I created for a different purpose. I would be glad to do some testing if that would be helpful. Alternatively, possibly I could provide a copy of the repo to you...it doesn't contain any sensitive information.

underhillian avatar Mar 24 '21 12:03 underhillian

I have a small test repo (a few text files, a few snapshots, <1.5Mb) that I created for a different purpose. I would be glad to do some testing if that would be helpful.

Thanks for the offer, however, I had a different type of test case in mind: I wanted to simplify the way to create a damaged repository / files. So that ideally should just involve running a small go program and not restic as a whole. Such a small test case would then allow digging into where exactly things go wrong.

MichaelEischer avatar Mar 27 '21 11:03 MichaelEischer

having just run this over an SMB share it worked fine for me (though a small 6GB backup). Is this still an issue and if so, anything I can do to help testing?

liambresnahan avatar Apr 21 '21 21:04 liambresnahan

For me everything works fine on a default Debian 10 installation with: restic 0.12.0 compiled with go1.16.3 on linux/amd64

I backuped my data to a fresh repo on a smb share and ran the following commands without any issues: restic -r /mnt/backup/repo forget --keep-last 1 --prune restic -r /mnt/backup/repo check

I even did a testrestore without any issues.

mwarth avatar May 03 '21 10:05 mwarth

Im running restic on a mac but the backup is mounted via SMB share from NAS running Linux. Does this issue apply in my situation?

csss1234 avatar Jun 22 '21 21:06 csss1234

@csss1234 No. This issue only applies when the host running restic uses Linux.

MichaelEischer avatar Jun 24 '21 19:06 MichaelEischer

@underhillian Which Kernel version are you using?

Did anyone else notice problems with restic + CIFS + go 1.16?

We could add a check to prevent using a repository stored on a CIFS mount on Linux. That would basically enforce the remark that currently exists in the documentation. The big downside of this workaround is that we'd have no clue at all whether that option will still be necessary in the future or not.

diff --git a/internal/backend/local/local.go b/internal/backend/local/local.go
index 0410e51b..d58d4880 100644
--- a/internal/backend/local/local.go
+++ b/internal/backend/local/local.go
@@ -5,6 +5,8 @@ import (
        "io"
        "os"
        "path/filepath"
+       "runtime"
+       "strings"
        "syscall"

        "github.com/restic/restic/internal/errors"
@@ -31,6 +33,24 @@ const defaultLayout = "default"
 // Open opens the local backend as specified by config.
 func Open(ctx context.Context, cfg Config) (*Local, error) {
        debug.Log("open local backend at %v (layout %q)", cfg.Path, cfg.Layout)
+
+       if runtime.GOOS == "linux" {
+               dbg, _ := os.LookupEnv("GODEBUG")
+               hasAsyncPreempt := !strings.Contains(dbg, "asyncpreemptoff=1")
+               if hasAsyncPreempt {
+                       var stat syscall.Statfs_t
+                       err := syscall.Statfs(cfg.Path, &stat)
+                       if err != nil {
+                               return nil, err
+                       }
+                       const CIFS_MAGIC_NUMBER = 0xff534d42
+                       if stat.Type == CIFS_MAGIC_NUMBER {
+                               return nil, errors.Fatal("Storing a repository on CIFS requires disabling" +
+                                       "asynchronous preemption by setting the environment variable GODEBUG to 'asyncpreemptoff=1'.")
+                       }
+               }
+       }
+
        l, err := backend.ParseLayout(ctx, &backend.LocalFilesystem{}, cfg.Layout, defaultLayout, cfg.Path)
        if err != nil {
                return nil, err

MichaelEischer avatar Jul 16 '21 22:07 MichaelEischer

Which Kernel version are you using?

I use Arch Linux (rolling updates) and update regularly so my kernel is almost never more than a couple of weeks behind linux-stable.

As best as I can tell from my trusty restic backups:wink:, at the time of my last test as reported above (Mar 07), I was running 5.11.2 (although it might have been a minor release or two later than that...can't be more precise at this point).

underhillian avatar Jul 22 '21 14:07 underhillian

I just upgraded a server at home from ubuntu LTS from 18.04 to 20.04 and now I can't run backups anymore on files that are on my NAS from my cifs mounted path.

I also upgraded restic before trying so I don't know which upgrade caused the problems.

This is all over the restic backup output now:

I don't have the repositoryu stored on the cifs mounted volume, I am only trying to create a backup of files that are mounted on a cifs file system.

input/output errorc ...redacted... error: NodeFromFileInfo: Listxattr: xattr.list 

If I run stat on one of the failed files I don't get any errors.

$ stat "...redacted..."
  File: ...redacted...
  Size: 5493072         Blocks: 10736      IO Block: 1048576 regular file
Device: 35h/53d Inode: 399240      Links: 1
Access: (0755/-rwxr-xr-x)  Uid: ( 1000/ thomasf)   Gid: (    0/    root)
Access: 2021-06-04 15:31:45.395788000 +0000
Modify: 2008-01-27 19:17:46.000000000 +0000
Change: 2008-01-27 19:17:46.000000000 +0000
 Birth: -

My kernel is

Linux gems 5.4.0-80-generic #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

thomasf avatar Aug 07 '21 12:08 thomasf

I also upgraded restic before trying so I don't know which upgrade caused the problems.

Please specify restic version. Lots of Linux distros use outdated restic versions, your best course of action to have the best restic version is to simply use the official restic binaries published here on GitHub.

rawtaz avatar Aug 07 '21 12:08 rawtaz

0.12.1 via the self-update subcommand

I also upgraded restic before trying so I don't know which upgrade caused the problems.

Please specify restic version. Lots of Linux distros use outdated restic versions, your best course of action to have the best restic version is to simply use the official restic binaries published here on GitHub.

thomasf avatar Aug 07 '21 12:08 thomasf

Hmm, when I write a test specifically targeting the same file that fails with i/o error inside resic it does not cause an i/o error. I might have missed something.

restic debug log:

2021/08/08 08:56:38 archiver/archiver.go:353	archiver.(*Archiver).Save	56	/media/dubcube/homes/thomasf/Pictures/_VERY_RANDOM/1999/08/17/jw41.jpg target "/media/dubcube/homes/thomasf/Pictures/_VERY_RANDOM/1999/08/17/jw41.jpg", previous <nil>
2021/08/08 08:56:38 archiver/archiver.go:384	archiver.(*Archiver).Save	56	  /media/dubcube/homes/thomasf/Pictures/_VERY_RANDOM/1999/08/17/jw41.jpg regular file
2021/08/08 08:56:38 restic/node.go:622	restic.(*Node).fillExtendedAttributes	61	fillExtendedAttributes(/media/dubcube/homes/thomasf/Pictures/_VERY_RANDOM/1999/08/17/jw40.jpg) [] Listxattr: xattr.list /media/dubcube/homes/thomasf/Pictures/_VERY_RANDOM/1999/08/17/jw40.jpg : input/output error
2021/08/08 08:56:38 archiver/file_saver.go:145	archiver.(*FileSaver).saveFile	60	/media/dubcube/homes/thomasf/Pictures/_VERY_RANDOM/1999/08/17/jw41.jpg
2021/08/08 08:56:38 archiver/archiver.go:492	archiver.(*Archiver).Save	56	return after 0.003

simple test:

package restic

import (
	"os"
	"syscall"
	"testing"

	"github.com/davecgh/go-spew/spew"
	"github.com/restic/restic/internal/restic"
)

const (
	fn = "/media/dubcube/homes/thomasf/Pictures/_VERY_RANDOM/1999/08/17/jw41.jpg"
)

func TestNodeFromFileInfo(t *testing.T) {

	fi, err := os.Stat(fn)
	if err != nil {
		t.Fatal(err)
	}

	stat, ok := toStatT(fi.Sys())
	t.Log(ok, stat)
	if !ok {
		t.Fail()
	}

	node, err := restic.NodeFromFileInfo(fn, fi)
	if err != nil {
		t.Fatal(err)
	}
	t.Log(spew.Sdump(node))
}

type statT syscall.Stat_t

func toStatT(i interface{}) (*statT, bool) {
	s, ok := i.(*syscall.Stat_t)
	if ok && s != nil {
		return (*statT)(s), true
	}
	return nil, false
}

test output:

$ ssh gems ./foo.test -test.v
=== RUN   TestNodeFromFileInfo
    random_test.go:24: true &{49 18317912 1 33261 1000 0 0 0 34898 1048576 72 {1622819926 309775400} {1283468376 0} {1283468376 0} [0 0 0]}
    random_test.go:33: (*restic.Node)(0xc0000d4640)(-rwxr-xr-x  1000     0  34898 2010-09-02 22:59:36 +0000 UTC jw41.jpg)

--- PASS: TestNodeFromFileInfo (0.02s)
PASS

thomasf avatar Aug 08 '21 09:08 thomasf

Is always the same file affected or does the error affect different files each time? The testcase probably only works in the former case.

Another major difference to the situation in restic, is that the overall load is completely different. restic usually keeps the system quite busy, whereas the testcase is a single command without putting any noteworthy stress on the system. As we know that the CIFS problems are triggered by the async. preemption feature of go, your testcase would have to run the problematic syscall at the time when the preemption signal arrives. For that the testcase would at least have to repeat the operation over and over again (probably on different files).

Please test whether setting GODEBUG=asyncpreemptoff=1 solves the problem for you.

MichaelEischer avatar Aug 08 '21 16:08 MichaelEischer

GODEBUG=asyncpreemptoff=1 does not seem to help.

It looks like it's different files every run.

It is a pretty slow home server (Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz) so it is probably more vulnerable to timing than the average modern machine.

I also tried setting all the archiver concurrencies in the code to 1 and adding a few time.Sleep(200ms) here and there around the areas where the errors appear. With the sleep it seems like error rate maybe went down a bit but far from disappeard.

I will try to set up a smaller test repo where I don't have tens of thousands of files for every run and see if I can narrow anything down.

thomasf avatar Aug 08 '21 18:08 thomasf

I have tried a bunch of different cifs mount settings now without any luck.

An flag that enabled you yo just skip extended attributes would also work (at least for me). It is only the listxattr syscall that fails and I don't think I use any of that for my cifs mounts.

I just did this to see what would happen and now I can at least run my backups without many thousands of errors

func (node *Node) fillExtendedAttributes(path string) error {
	return nil

So this is probably not the same fault that this issue is really about but it is somewhat related. The syscalls does not as far as I can understand time out at all, they actually fail with io error which is what strace also says.

Update:

After changing to not doing the listxattr syscall I could run my ~60k files ~3TB backups from the cifs mount without issues with or without GODEBUG=asyncpreemptoff=1 without any unexpected errors.

thomasf avatar Aug 09 '21 07:08 thomasf

Did you try the nouser_xattr mount option?

It is a pretty slow home server (Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz) so it is probably more vulnerable to timing than the average modern machine.

The symptoms sound like a serious bug in either the CIFS server or client. Or maybe it is also a network problem. Although the CPU of the server isn't the fastest, it's also not the slowest one either.

MichaelEischer avatar Aug 09 '21 20:08 MichaelEischer

Did you try the nouser_xattr mount option?

I missed that one and it also seems to work , have been running through a few tens of thousands files already and should definitely have failed by now.

It is a pretty slow home server (Intel(R) Core(TM)2 Duo CPU E6850 @ 3.00GHz) so it is probably more vulnerable to timing than the average modern machine.

The symptoms sound like a serious bug in either the CIFS server or client. Or maybe it is also a network problem. Although the CPU of the server isn't the fastest, it's also not the slowest one either.

I ran a few network load tests and even upgraded the switch firmware so it's probably not that. It's probably some interaction between the samba/kernel version of my qnap NAS and the CIFS client in the latest Ubuntu LTS kernel.

In any case nouser_xattr seems to work so It's all good for now.

thomasf avatar Aug 10 '21 08:08 thomasf

After many months of using nouser_xattr this problem still occurs from time to time. Same dataset, roughly same changes every day - most days no errors, some days there are. I still cannot see a pattern why this is happening :-(

uok avatar Aug 10 '21 10:08 uok