influxdb Data loss on reboot.

Steps to reproduce: List the minimal actions needed to reproduce the behaviour.

InfluxDb 2.7.5 installed on Ubuntu 20.04.6 in a VMWare virtual machine.
A bucket with no retention policy that keeps data forever and gets updated once a minute (by Telegraf).

Expected behaviour: When rebooting the machine using 'sudo reboot' pending data should be written to disk and should survive reboot.

Actual behaviour: In some situations (not always) recent data is lost. The amount of missing data can range from an hour back to a whole week! It appears like cached data is not flushed out to persistent storage when the system is rebooted.

Feb 27 '24 11:02 warped-rudi

When data is written, it is stored in an in-memory cache and flushed to a write-ahead log (WAL) on disk. Periodically, the cache is written to disk as TSM files, but if InfluxDB is shut down abruptly, the WAL should be read into the in-memory cache on start-up. So, the scenario you describe should not usually lead to data loss.

Did you verify the presence of the data using queries before the reboot?

Here are better explanations of the WAL and cache: https://docs.influxdata.com/influxdb/v2/reference/internals/storage-engine/#write-ahead-log-wal.

To understand why you are seeing data loss on reboot, you should

Verify the data are present before the reboot.
See what files are in your WAL directory and TSM directories before and after the reboot.

In other words, it looks like you are losing data that should have been persisted to the disk as WAL files, or even as TSM files, which is odd.

It is unlikely that week-old data is still in the WAL; with the default configuration, that data should have been written to the TSM files which are the permanent storage for data. So you may be losing data that is in TSM files, which points to something wrong with the file system.

Here are the parameters which control when data moves from the cache to TSM files:

And here are parameters for how data is written to the WAL

And, of course, it is best to shut down InfluxDB before a reboot.

Jul 09 '24 20:07 davidby-influx

I re-tested today. After reboot only data from the last two days were visible. Everything older than that is gone. The *.tsm files before and after reboot are identical (md5). What differs are the fields.idx{l} files. In both situations there were two *.wal files in <bucketid/autogen/highest_number>. One of them being identical as well, the other one slightly bigger than before the reboot. New data has most likely arrived in between.

This InfluxDB instance runs as a daemon controlled by systemd. So on reboot it should receive a proper notification to clean things up. The most strange thing is, that another bucket on the very same instance doesn't show these problems.

Jul 17 '24 15:07 warped-rudi

fields.idx is rewritten on shutdown and startup, so it would be expected to change. the fields.idxl file is consolidated into a more compact representation in fields.idx.

In the logs on restart, do you see the files which contain the missing data being re-opened? Something like this:

2024-07-17T19:31:54.081632Z	info	Opened file	{"log_id": "0qSYidNl000", "engine": "tsm1", "service": "filestore", "path": "/home/davidby/.influxdb/data/DST_TEST/autogen/133/000000001-000000001.tsm", "id": 0, "duration": "0.084ms"}

Jul 17 '24 20:07 davidby-influx

Yes. All the files are opened.

service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/131/000000015-000000002.tsm id=0 duration=8.091ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/160/000000020-000000002.tsm id=0 duration=32.630ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/150/000000019-000000002.tsm id=0 duration=36.753ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/180/000000019-000000002.tsm id=0 duration=44.064ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/140/000000020-000000002.tsm id=0 duration=150.393ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/210/000000020-000000002.tsm id=0 duration=180.253ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/170/000000020-000000002.tsm id=0 duration=81.524ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/190/000000020-000000002.tsm id=0 duration=634.996ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/200/000000019-000000002.tsm id=0 duration=312.513ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/230/000000019-000000002.tsm id=0 duration=228.602ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/250/000000020-000000002.tsm id=0 duration=168.714ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/260/000000006-000000001.tsm id=5 duration=95.667ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/220/000000020-000000002.tsm id=0 duration=552.318ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/260/000000003-000000001.tsm id=2 duration=57.136ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/260/000000002-000000001.tsm id=1 duration=135.037ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/260/000000004-000000001.tsm id=3 duration=63.056ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/260/000000001-000000001.tsm id=0 duration=195.773ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/260/000000005-000000001.tsm id=4 duration=54.510ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/240/000000020-000000002.tsm id=0 duration=445.985ms

Early this morning the latest *-000000001.tsm files got compacted into a single *-000000002.tsm. Since then new data is written into *.wal file(s). Currently two of them exist. One is 10MB in size and has been updated about 3 hours ago. The other is 6MB and growing.

BTW, The parameter you mentioned are all at their defaults.

Jul 18 '24 08:07 warped-rudi

If I may summarize:

All your data appears to be present in the file system after a reboot, because the files are unchanged or larger.
Your reboots should gracefully terminate influxd
The data loss is sometimes recent data, and sometimes older data
- In one case, older data were purged, but the most recent 2 days were kept
- In other cases, recent data was lost, from the most recent 2 hours to the most recent week, but older data were kept.
- Sometimes no data is lost

The symptoms certainly seem varied and odd. Perhaps try running influxd inspect report-tsm and see if the results differ before and after a reboot, and if the reported time range of the files has changed. There may be other influxd inspect commands that you can run, depending on the size of your instance that produce other diagnostics.

With the wide differences in symptoms, it's hard to think of a single cause for this.

Jul 18 '24 20:07 davidby-influx

The 3rd point is not correct. I always loose old data. How much of recent date is kept depends. But 2..3 days is 'normal'. 'influxd inspect report-tsm' gives this:

DB               RP      Shard   File                    Series  New     Min Time                       Max Time                      Load Time
7400f96f21ce1327 autogen 131     000000015-000000002.tsm 567     567     2024-04-16T13:01:22.617446002Z 2024-04-21T23:59:50Z          248.818µs
7400f96f21ce1327 autogen 140     000000020-000000002.tsm 578     11      2024-04-22T00:00:00Z           2024-04-28T23:59:50Z          182.284µs
7400f96f21ce1327 autogen 150     000000019-000000002.tsm 586     8       2024-04-29T00:00:00Z           2024-05-05T23:59:50Z          181.991µs
7400f96f21ce1327 autogen 160     000000020-000000002.tsm 578     0       2024-05-06T00:00:00Z           2024-05-12T23:59:50Z          185.042µs
7400f96f21ce1327 autogen 170     000000020-000000002.tsm 567     0       2024-05-13T00:00:00Z           2024-05-19T23:59:50Z          265.961µs
7400f96f21ce1327 autogen 180     000000019-000000002.tsm 578     0       2024-05-20T00:00:00Z           2024-05-26T23:59:57.60829245Z 204.708µs
7400f96f21ce1327 autogen 190     000000020-000000002.tsm 589     11      2024-05-27T00:00:00Z           2024-06-02T23:59:50Z          155.707µs
7400f96f21ce1327 autogen 200     000000019-000000002.tsm 578     0       2024-06-03T00:00:00Z           2024-06-09T23:59:50Z          123.33µs
7400f96f21ce1327 autogen 210     000000020-000000002.tsm 578     0       2024-06-10T00:00:00Z           2024-06-16T23:59:50Z          201.546µs
7400f96f21ce1327 autogen 220     000000020-000000002.tsm 578     0       2024-06-17T00:00:00Z           2024-06-23T23:59:50Z          144.605µs
7400f96f21ce1327 autogen 230     000000019-000000002.tsm 578     0       2024-06-24T00:00:00Z           2024-06-30T23:59:57.16992822Z 123.592µs
7400f96f21ce1327 autogen 240     000000020-000000002.tsm 578     0       2024-07-01T00:00:00Z           2024-07-07T23:59:50Z          121.468µs
7400f96f21ce1327 autogen 250     000000020-000000002.tsm 578     0       2024-07-08T00:00:00Z           2024-07-14T23:59:50Z          132.125µs
7400f96f21ce1327 autogen 260     000000009-000000002.tsm 578     0       2024-07-15T00:00:00Z           2024-07-18T01:40:40Z          121.943µs
7400f96f21ce1327 autogen 260     000000011-000000001.tsm 578     0       2024-07-18T01:40:47.93204727Z  2024-07-18T10:47:20Z          47.504µs
7400f96f21ce1327 autogen 260     000000012-000000001.tsm 570     0       2024-07-18T10:47:29.823678733Z 2024-07-18T19:50:50Z          52.851µs
7400f96f21ce1327 autogen 260     000000013-000000001.tsm 578     0       2024-07-18T19:51:00Z           2024-07-19T05:16:40Z          60.502µs

Summary:  Files: 17
  Time Range: 2024-04-16T13:01:22.617446002Z - 2024-07-19T05:16:40Z
  Duration: 2248h15m17.382553998s

So theoretically there should be data available from mid 2024-04-16 to now. But data shown starts a 2024-07-15T00:00:00. So it looks like only the last Shard (260) gets presented and the older files are ignored.

Jul 19 '24 08:07 warped-rudi

I wrote the third point, that sometimes no data was lost because you had written:

In some situations (not always) recent data is lost.

In any event, we see the shards being opened.

If you are using the TSI index, try influd inspect dump-tsi usinf the various flags to verify that the data you think is present is indexed. Skip this if you are using an in-mem index.
use influxd inspect verify-tsm to check that your TSM files are all valid.
Use influxd inspect dump-tsm to see if thee data is present (be careful which flags you use, this can output a lot of data).

Jul 19 '24 23:07 davidby-influx

I copied the few remaining data over to a newly created bucket. Then I tried to delete to old one, which did not work from the WebUI. There was no reaction when confirming the click on the trashcan icon. It was, however, possible to delete the bucket from the command line. But I had to supply the 'Auth Token' as parameter to 'influx bucket delete'. Otherwise the command would hang infinitely without any error message. After that procedure, it appears to work. Will need more time to verify.

Unfortunately, another issue showed up recently. Whenever I update influxdb2 to any version newer than 2.7.6, some flux queries take ages to complete. I guess, I should open a new ticket for that.

Aug 05 '24 10:08 warped-rudi

After several weeks of testing I would say that the problem has gone away. To wrap it up:

for some unknown reason a bucket got corrupted
this corruption caused any operation on this bucket to become extremely slow or to even hang infinitely
influxdb2 seems to process shutdown for each bucket sequentially in creation order
thus, the problem with this single bucket prevented proper shutdown for any other bucket created later
buckets created before the bad one were unaffected
removing the corrupted bucket forcefully made things work again

I'll close this ticket now. Thanks for your attention!

Sep 02 '24 08:09 warped-rudi

influxdb influxdb copied to clipboard

Data loss on reboot.

influxdb
influxdb copied to clipboard