backrest icon indicating copy to clipboard operation
backrest copied to clipboard

SQlite3 error on 1.10.0

Open florianmski opened this issue 1 month ago • 19 comments

Describe the bug After updating from 1.9.2 to latest 1.10.1 I'm getting the following logs, backrest just won't start and loop on these same 4 lines.

2025-11-01T13:37:27.939+0100    INFO    backrest version 1.10.1@b9181dc00d7c595c0ff1eae333c6ecdecee96b74, using log directory: /data/processlogs
2025-11-01T13:37:28.194+0100    INFO    restic binary "/bin/restic" in $PATH matches required version 0.18.1, it will be used for backrest commands
2025-11-01T13:37:33.269+0100    WARN    operation log may be corrupted, if errors recur delete the file "/data/oplog.sqlite" and restart. Your backups stored in your repos are safe.
2025-11-01T13:37:33.269+0100    FATAL    error creating oplog: run multiline query: sqlite3: disk I/O error

I've tried to follow the warning instructions and delete oplog.sqlite but I'm still experiencing the same, the file is recreated but stays blank.

Platform Info

  • OS and Architecture: Linux 3.10.108 x86_64 GNU/Linux synology_braswell_716+
  • Backrest Version: 1.10.0 and 1.10.1

Additional context Looking at the changes that went into 10.0.0 it seems another library to write the sql is being used, maybe that's where the problem comes from? I'm running backrest on a quite old Synology and so far it has been working great. I would be tempted to say it's a problem with my docker / filesystem setup but it does work well with other containers and also wouldn't explain why it works great on 1.9.2 but not 1.10.1. Moving back to 1.9.2 works again so I've got a fix for now at least. Thanks for the great software and let me know if I can help pin down the issue further.

florianmski avatar Nov 01 '25 13:11 florianmski

Sorry you're running into this & thanks for the detailed report / steps you've taken to root cause the versions involved.

It definitely could be the case that the sqlite driver is now incompatible with certain filesystems / OSes which isn't great. Interested to track down what exactly you're running into -- are there any details about the filesystem you're running on you can share? is it ext4?

garethgeorge avatar Nov 01 '25 23:11 garethgeorge

The filesystem I'm using is btrfs

florianmski avatar Nov 02 '25 14:11 florianmski

Does any of the debugging advice in https://stackoverflow.com/questions/9993555/disk-i-o-error-with-sqlite apply to your situation? You might try fully removing both the database and all shm files associated with it if you haven't already.

garethgeorge avatar Nov 02 '25 23:11 garethgeorge

I have the same issue, v1.9.2 is working fine but 1.10.0 not. I already deleted all sqllite files and cache but noting helps.

steinbrueckri avatar Nov 03 '25 14:11 steinbrueckri

I very much wonder if this is an issue with how I've configured the driver https://github.com/garethgeorge/backrest/blob/93becf3e328be1ae132a3c386c204c97648fa6cd/internal/oplog/sqlitestore/sqlitestore.go#L63-L71

@steinbrueckri if you can provide any additional details about the hardware you're running. Primarily OS version, CPU family, and filesystem the data volume containing the oplog is running on.

Perhaps I'm being too optimistic about WAL support on lowest common denominator hardware. cc @ncruces who was able to give me some very helpful tips re: the initial integration, curious if you have any hardening recs here / a most stable subset of features I can enable and expect to work for all users?

Might abandon WAL if that improves reliability.

garethgeorge avatar Nov 03 '25 21:11 garethgeorge

On the surface that looks OK. I would definitely expect WAL to work if vfs.SupportsSharedMemory is true.

There's a “problem” with setting PRAGMAs that way, which is they may only apply to a single connection from the pool. That happens with all SQLite drivers, which is why you should preferably set PRAGMAs in the DSN. But at least for WAL mode the change is persistent, so that shouldn't be the issue.

sqlite3: disk I/O error is not a lot to go with. This is a problem I struggled with for some time. The next release should improve matters by wrapping the underlying OS error that caused the I/O error. I plan to cut a tag as soon as the next SQLite version is released, which I expected to be late October.

If you can build and test against main, that would be very helpful.

PS: this is where I improved error reporting: https://github.com/ncruces/go-sqlite3/pull/327

ncruces avatar Nov 03 '25 22:11 ncruces

@steinbrueckri if you can provide any additional details about the hardware you're running. Primarily OS version, CPU family, and filesystem the data volume containing the oplog is running on.

Im running backrest on my Synology NAS in a container and i mount /data /cache etc. from /volume1, like this ...

  backrest:
    image: garethgeorge/backrest:v1.9.2
    container_name: backrest
    hostname: backrest
    volumes:
      - ./data:/data
      - ./config:/config
      - ./cache:/cache
      - ./tmp:/tmp
      - ./ssh:/root/.ssh
      - /volume1:/userdata # Mount local paths to backup
    environment:
      - BACKREST_DATA=/data
      - BACKREST_CONFIG=/config/config.json
      - XDG_CACHE_HOME=/cache
      - TMPDIR=/tmp
      - TZ=Europe/Berlin
    ports:
      - "9898:9898"
    restart: unless-stopped

Linux NAS02 3.10.108 #69057 SMP Mon Jul 21 23:25:00 CST 2025 x86_64 GNU/Linux synology_braswell_916+

steinbrueckri@NAS02:~$ sudo docker ps
CONTAINER ID   IMAGE                                       COMMAND                  CREATED         STATUS                  PORTS                                                                                                                                                                                    NAMES
b76fad857f62   garethgeorge/backrest:v1.9.2                "/sbin/tini -- /dock…"   17 hours ago    Up 17 hours             0.0.0.0:9898->9898/tcp, :::9898->9898/tcp                                                                                                                                                backrest
1cb9c07653fb   lscr.io/linuxserver/openssh-server:latest   "/init"                  17 hours ago    Up 17 hours             0.0.0.0:2222->2222/tcp, :::2222->2222/tcp                                                                                                                                                openssh-server
47c529491ab2   containrrr/watchtower:latest                "/watchtower --label…"   17 hours ago    Up 17 hours (healthy)   8080/tcp 
steinbrueckri@NAS02:~$ df -T | grep vol
/dev/mapper/cachedev_0 btrfs    33719184264 7023942064 26695242200  21% /volume1

steinbrueckri avatar Nov 04 '25 07:11 steinbrueckri

Thanks for chiming in ncruces@ really appreciate it.

On the surface that looks OK. I would definitely expect WAL to work if vfs.SupportsSharedMemory is true.

Agreed-- it's surprising. I think it's interesting that both reports so far are synology devices and both are using a btrfs filesystem. I'll try to reproduce on my own system to see if this is possibly a btrfs issue, but it seems this kernel is from ~2013. It's very possible there's either a buggy implementation OR unsupported feature.

vfs.SupportsSharedMemory looks like it's a property of the build-- so I suppose my code isn't actually checking if the filesystem / OS supports shared memory / locking.

There's a “problem” with setting PRAGMAs that way, which is they may only apply to a single connection from the pool. That happens with all SQLite drivers, which is why you should preferably set PRAGMAs in the DSN. But at least for WAL mode the change is persistent, so that shouldn't be the issue.

This is a good callout, thanks!

sqlite3: disk I/O error is not a lot to go with. This is a problem I struggled with for some time. The next release should improve matters by wrapping the underlying OS error that caused the I/O error. I plan to cut a tag as soon as the next SQLite version is released, which I expected to be late October. If you can build and test against main, that would be very helpful.

PS: this is where I improved error reporting: ncruces/go-sqlite3#327

Great, yes happy to see if we can get some traces here. I can create a branch with some snapshots that are linked against head for the purposes of debugging, it may require building some docker images assuming both users in this thread are using dockerized installs. I should be able to get this together ~Wednesday.

I think I can also add a BACKREST_BASIC_SQLITE or such environment variable (or a safemode fallback if database init fails?) to try to unblock these users.

garethgeorge avatar Nov 04 '25 09:11 garethgeorge

Before too much work goes into creating a branch, if this is not super urgent, the current target for SQLite release is November 5.

ncruces avatar Nov 04 '25 10:11 ncruces

Oh, sorry. No need for more tests.

I'm using OFD locks, which were introduced on Linux 3.15: https://man7.org/linux/man-pages/man2/fcntl_locking.2.html

I need to add this to my documentation, because I often forget. Sorry.

If the user is building your software from scratch they can try the sqlite3_flock or sqlite3_dotlk build tags. These should work. The second one should work anywhere regardless of OS version or architecture.

But if they do, it is important to understand that they can never open the databases concurrently with another process (like the sqlite3 CLI) or they will irreparably corrupt data.

ncruces avatar Nov 04 '25 10:11 ncruces

Just because I mentioned the release: https://github.com/ncruces/go-sqlite3/releases/tag/v0.30.0

But I'm fully convinced the issue is Linux 3.10.

ncruces avatar Nov 05 '25 12:11 ncruces

Sorry for the late reply -- brilliant, yes I think your understanding makes sense to me and thanks for getting the patch release out. Backrest can go ahead and take advantage of that.

If the user is building your software from scratch they can try the sqlite3_flock or sqlite3_dotlk build tags. These should work. The second one should work anywhere regardless of OS version or architecture.

I read this blurb in the docs at initial setup time but really didn't quite grok the distinctions / particularly what it'd mean for older platform support. I'm torn between going with sqlite3_dotlk (which sounds like it should offer the best breadth of support + performance?) and adopting sqlite3_flock which will provide the best protection.

I think the changes I'll go with for the next patch release will be:

  1. Update to newest ncruces/go-sqlite3 , let's do get the improved error messages.
  2. Most likely adopt the sqlite3_dotlk package for maximum compatibility... I've been considering for a while and should perhaps prioritize a database backup feature as a mitigation of the risk re: opening and corrupting it.

Just want to add on, this wrinkle seems to be entirely my fault re: integration, really impressed for the most part with how seamless the move to ncruces/go-sqlite3 has been and that your library has dramatically simplified my dependency management while maintaining a very high standard of platform support compared to the modernc alternatives.

garethgeorge avatar Nov 05 '25 20:11 garethgeorge

About the build tags, I did lots of writing, but it's somewhat hard to convey knowledge.

For why I try to avoid POSIX locks, and prefer OFD, you could skim this rant from SQLite devs.

The point of build tags is that the default (no tags) is strictly compatible with the default SQLite locking protocol for each platform.

This basically means if someone decided to use the sqlite3 CLI to inspect a backrest database, they won't corrupt it. It also means they can use sqlite3_rsync or Litestream.

If you decide to go with sqlite3_dotlk (or sqlite3_flock) using the CLI on your database risks corruptions, and the backup tools don't work.

ncruces avatar Nov 05 '25 22:11 ncruces

Adopted sqlite3_dotlk in https://github.com/garethgeorge/backrest/commit/3d4cc3806c6145650d1dd08e1343ae7ab8b62fca and added some quick and dirty backups in https://github.com/garethgeorge/backrest/commit/5c93d99a404fa028a7a5a37ec39b19d13d34b736 which run when backrest is restarted if there hasn't been a backup recently OR if the latest version's schema differs from that of the version that created the last backup).

garethgeorge avatar Nov 13 '25 05:11 garethgeorge

But I'm fully convinced the issue is Linux 3.10.

I’m having the same issue on Linux 6.15.7, Ubuntu 24.04. Oracle Cloud arm instance with ext4 for the filesystem.

fhuhne avatar Nov 14 '25 04:11 fhuhne

I doubt it's the same issue, even if the log is the same. Please try again with a version that's built against ncruces/go-sqlite3v0.30.1 and check the more detailed error message. It's very unlikely with Linux 6.15 the issue is locking.

ncruces avatar Nov 14 '25 10:11 ncruces

But I'm fully convinced the issue is Linux 3.10.

I’m having the same issue on Linux 6.15.7, Ubuntu 24.04. Oracle Cloud arm instance with ext4 for the filesystem.

Easiest way to try this out will be one of the snapshot builds on https://github.com/garethgeorge/backrest/actions/runs/19321366837 if you are running backrest baremetal, otherwise I'm expecting to get the patch release out in the upcoming week.

garethgeorge avatar Nov 15 '25 06:11 garethgeorge

Running on docker, will wait for the next release then and report back if it’s fixed.

fhuhne avatar Nov 15 '25 07:11 fhuhne

But I'm fully convinced the issue is Linux 3.10.

I’m having the same issue on Linux 6.15.7, Ubuntu 24.04. Oracle Cloud arm instance with ext4 for the filesystem.

At least I can confirm that, on my side, it only fails on the system running a 3.10 kernel.

Image

steinbrueckri avatar Nov 17 '25 09:11 steinbrueckri