influxdb Backups created on one instance cannot be restored to another: "An internal error has occurred"

This issue is raised from a conversation on the InfluxDB Community Slack with @samhld. I've been trying to transfer bucket data between InfluxDB instances using the backup/restore functionality and seeing problems with the restore operation.

The influx restore call appears to fail:

❯ influx restore --full . -t wJMugsVT2fp8MBiFqqPcOVO2yYvUSdAAU0Ou9xbbf7RTJ293ewGMJqOjeAs9EP6edth-8C1b_ssoPNgGQ8s27g==
2022-07-28T17:30:43.318507Z     info    Restoring full metadata from local backup       {"log_id": "0byTbrxG000", "path": "20220728T003806Z.bolt.gz"}
Error: Failed to upload local KV backup at "20220728T003806Z.bolt.gz": An internal error has occurred - check server logs.
See 'influx restore -h' for help

The server logs then show a slightly less confusing message:

❯ sudo journalctl -fu influxdb
-- Logs begin at Sun 2022-07-10 19:19:14 PDT. --
Jul 28 08:55:10 hitl-nuc influxd-systemd-start.sh[11661]: ts=2022-07-28T15:55:10.151870Z lvl=info msg="Retention policy deletion check (start)" log_id=0bxTmYe0000 service=retention op_name=retention_delete_check op_event=start
Jul 28 08:55:10 hitl-nuc influxd-systemd-start.sh[11661]: ts=2022-07-28T15:55:10.152082Z lvl=info msg="Retention policy deletion check (end)" log_id=0bxTmYe0000 service=retention op_name=retention_delete_check op_event=end op_elapsed=0.232ms
Jul 28 09:25:10 hitl-nuc influxd-systemd-start.sh[11661]: ts=2022-07-28T16:25:10.151421Z lvl=info msg="Retention policy deletion check (start)" log_id=0bxTmYe0000 service=retention op_name=retention_delete_check op_event=start
Jul 28 09:25:10 hitl-nuc influxd-systemd-start.sh[11661]: ts=2022-07-28T16:25:10.151927Z lvl=info msg="Retention policy deletion check (end)" log_id=0bxTmYe0000 service=retention op_name=retention_delete_check op_event=end op_elapsed=0.516ms
Jul 28 09:55:10 hitl-nuc influxd-systemd-start.sh[11661]: ts=2022-07-28T16:55:10.151855Z lvl=info msg="Retention policy deletion check (start)" log_id=0bxTmYe0000 service=retention op_name=retention_delete_check op_event=start
Jul 28 09:55:10 hitl-nuc influxd-systemd-start.sh[11661]: ts=2022-07-28T16:55:10.152669Z lvl=info msg="Retention policy deletion check (end)" log_id=0bxTmYe0000 service=retention op_name=retention_delete_check op_event=end op_elapsed=0.833ms
Jul 28 10:25:10 hitl-nuc influxd-systemd-start.sh[11661]: ts=2022-07-28T17:25:10.152179Z lvl=info msg="Retention policy deletion check (start)" log_id=0bxTmYe0000 service=retention op_name=retention_delete_check op_event=start
Jul 28 10:25:10 hitl-nuc influxd-systemd-start.sh[11661]: ts=2022-07-28T17:25:10.152724Z lvl=info msg="Retention policy deletion check (end)" log_id=0bxTmYe0000 service=retention op_name=retention_delete_check op_event=end op_elapsed=0.566ms
Jul 28 10:25:45 hitl-nuc influxd-systemd-start.sh[11661]: ts=2022-07-28T17:25:45.734845Z lvl=warn msg="internal error not returned to client" log_id=0bxTmYe0000 handler=error_logger error="unable to open boltdb: invalid database"
Jul 28 10:30:43 hitl-nuc influxd-systemd-start.sh[11661]: ts=2022-07-28T17:30:43.388630Z lvl=warn msg="internal error not returned to client" log_id=0bxTmYe0000 handler=error_logger error="unable to open boltdb: invalid database"

AFAIK I'm running the latest versions of InfluxDB and Influx CLI:

❯ influxd version
InfluxDB v2.3.0+SNAPSHOT.090f681737 (git: 090f681737) build_date: 2022-06-16T19:33:50Z
❯ influx version
Influx CLI 2.3.0 (git: 88ba346) build_date: 2022-04-06T19:30:53Z

@samhld indicated that he has escalated this to the engineering team and suggested that I file a report here for further investigation. Please let me know what additional information would be helpful in the debugging process.

Aug 10 '22 19:08 neilbalch

Are you trying to restore into an existing database or a new one?

Are you able to share the backup with us? We can provide a private SFTP to upload it to - that would allow us to try the restore and fully diagnose the issue.

Aug 16 '22 13:08 lesam

@lesam This is an existing database instance with an intersecting, but not subset of the backup dataset. This is to say that it already has some of the data in the backup from previous attempts and backup approaches, but it also has fresh data that it generated and is not in the backup.

Sure! I don't think we have a problem with sharing it for diagnostic purposes. IIRC it's ~100 GB, though... as long as you are okay with that.

Aug 16 '22 15:08 neilbalch

@neilbalch Where is the influxdb data being stored? On a local file system? Network file system? Virtual machine / container?

My intuition is that the error is related to the overlapping data / shards / retention policies, but I would not expect the unable to open boltdb: invalid database error. I would also think you'd be having much bigger issues if the boltdb metadata store was actually corrupt.

I am also working on getting something setup so you can send the backup for us to analyze.

Aug 17 '22 23:08 gwossum

@gwossum

Where is the influxdb data being stored? On a local file system? Network file system? Virtual machine / container?

Local NTFS disk on bare-metal (i.e. not virtualized) Ubuntu 18.04

I am also working on getting something setup so you can send the backup for us to analyze.

Thank you! Yeah, I'm also fairly confused by this. If it were true that the boltdb store was corrupt, I would expect InfluxDB as a whole to be unresponsive, but the localhost:8086 admin page and write/query operations work normally.

Aug 18 '22 01:08 neilbalch

@neilbalch I'm working on getting the sftp site set up.

While that's going on, would it possible for you to try restoring the backup into a brand new InfluxDB instance? If that works, that's a strong indicator its overlapping data / shards / metadata causing the issue.

Regarding Linux with NTFS, it's possible the backup is corrupt. NTFS has issues creating symlinks, which the internal InfluxDB snapshotting service used for backup uses by default. This was fixed for Windows hosts with this PR, but I do not believe that PR would correct Linux issues since it checks the running OS to make the decision, not the underlying file system.

Also, Linux NTFS support can be a bit iffy, depending on which driver you use and which version. BoltDB has known issues on file systems that don't faithfully support all mmap semantics. A Linux host with an NTFS file system isn't a combination we test or necessarily support. You might try a native Linux file system like ext4 or xfs and see if you get different results.

Aug 18 '22 18:08 gwossum

@neilbalch I have the sftp setup. Can I email the details to the email in your GH profile?

Aug 18 '22 19:08 gwossum

@gwossum Hrm, interesting! Yes, that email would work

Aug 18 '22 19:08 neilbalch

I'm skeptical that this is an NTFS issue though... Both the backup and InfluxDB instance have data stored on separate NTFS-formatted disks and only one (importing from a backup) is failing.

Aug 18 '22 19:08 neilbalch

@gwossum Good morning! The SFTP file transfer has finished and you should see two backups on the server: 20220728 and the more recent 20220808. Both were created with influx backup in the exact same way, but on different dates.

Curious to hear what you can find from them.

Aug 19 '22 18:08 neilbalch

influxdb influxdb copied to clipboard

Backups created on one instance cannot be restored to another: "An internal error has occurred"

influxdb
influxdb copied to clipboard