influxdb icon indicating copy to clipboard operation
influxdb copied to clipboard

[1.8.9]: troubles with restore + tsi1 + insane memory usage

Open sahib opened this issue 4 years ago • 3 comments

Hello,

I have a couple of weird issues restoring a backup made on our production instance (v1.8.6) to one of our staging environments (v1.8.9). This is an automated process in our case and worked until 2-3 weeks before.

Since I was hit by #21991 I updated from 1.8.6 to 1.8.9, otherwise I was not able to even start the backup. Long term we want to upgrade to 2.0, but until that happens we're stuck with 1.8.x for some time. The whole idea was to test the switch to tsi1 on staging before doing it on our prod instance, therefore the backup/restore cycle investigation listed below. But the "real" issue is that we want to save some memory on our prod instance.

I'm happy to deliver more info if needed. Any idea how to progress here?


Steps to reproduce:

  1. Take backup of prod instance using influxd backup -portable /some/dir
  2. Transfer to staging instance using rsync.
  3. Try to restore using influxd restore -portable /var/lib/influxdb/backup using various config options.

(NOTE: actual commands are slightly longer, due to dockerized env, but effectively the same)

Expected behavior:

Restore would work with 1.8.6 or at least 1.8.9.

Actual behavior:

It does not.

  1. 1.8.6 restore fails immediately with an error message similar to the ones in this ticket #9968.

  2. 1.8.9 seemingly first works, but eats huge amounts of data (I had to increase instance size to 32G + swap to progress further). After importing roughly 20G of data, it crashed with an out-of-memory error (see log). Despite the error it still had enough memory available.

  3. After setting the indexing back to "inmem" the excessive memory consumption was gone, but the restore still crashed halfway through (see other log).

  4. After reading this up, I followed a few suggestions which were:

    Add vm.max_map_count=2048000 to /etc/sysctl.conf and activate it.
    Set "max-concurrent-compactions" to 0.
    

    With this setup the restore worked (in the sense that the restore command returned successfully), but still produced a OOM error shortly after. After a restart of the influxd process the data was (mostly?) there though. I'm not 100% certain the two previous command did an effect, maybe just "luck". I forgot to save that log, but it looked pretty much like the previous ones, except different timestamps.

  5. When trying to restart now with tsi1 enabled, the insane memory consumption happens again. This seems to be a more general issue in our case.

In all cases starting from 2 to 4 I also see plenty of those logs:

lvl=warn msg="Error while freeing cold shard resources" service=storeerror="engine is closed" db_shard_id=23510

Environment info:

  • Linux 5.4.0-1029-aws x86_64
  • InfluxDB v1.8.9 (git: 1.8 d9b56321d579)
  • I use the *-alpine variant of the docker images.
  • The size of the backup is roughly 31G.
  • The cardinality of our series is 4206 (as shown by SHOW SERIES CARDINALITY), which does not seem that high...

Config:

Config is pretty much default, except the modifications described above.

sahib avatar Oct 11 '21 11:10 sahib

i have same error like "2022/03/30 21:41:37 Error writing: [DebugInfo: worker #0, dest url: http://test217:8086] Invalid write response (status 500): {"error":"engine is closed"}"

tomchon avatar Mar 30 '22 13:03 tomchon

Hello folks,

We can confirm: this problem persists.

After a disaster recovery, we cannot import the data from one node to another node. Influx takes up all SWAP/RAM and after a while the error described above occurs.

tuxracer1337 avatar Jul 25 '23 10:07 tuxracer1337

+1 deadly bad only solution was the binary line_file that worked..

influx_inspect export -datadir /var/lib/influxdb/data -waldir /var/lib/influxdb/wal -out /var/lib/influxdb/influx-backup/backup-${GC_ID} -compress -database metrics -retention autogen

influx -import -compressed -path /backups/${GC_ID}-influx-backup/data/influxdb/influx-backup/backup-${GC_ID}

rosscdh avatar Feb 09 '24 22:02 rosscdh