influxdb
influxdb copied to clipboard
Data loss on reboot.
Steps to reproduce: List the minimal actions needed to reproduce the behaviour.
- InfluxDb 2.7.5 installed on Ubuntu 20.04.6 in a VMWare virtual machine.
- A bucket with no retention policy that keeps data forever and gets updated once a minute (by Telegraf).
Expected behaviour: When rebooting the machine using 'sudo reboot' pending data should be written to disk and should survive reboot.
Actual behaviour: In some situations (not always) recent data is lost. The amount of missing data can range from an hour back to a whole week! It appears like cached data is not flushed out to persistent storage when the system is rebooted.
When data is written, it is stored in an in-memory cache and flushed to a write-ahead log (WAL) on disk. Periodically, the cache is written to disk as TSM files, but if InfluxDB is shut down abruptly, the WAL should be read into the in-memory cache on start-up. So, the scenario you describe should not usually lead to data loss.
Did you verify the presence of the data using queries before the reboot?
Here are better explanations of the WAL and cache: https://docs.influxdata.com/influxdb/v2/reference/internals/storage-engine/#write-ahead-log-wal.
To understand why you are seeing data loss on reboot, you should
- Verify the data are present before the reboot.
- See what files are in your WAL directory and TSM directories before and after the reboot.
In other words, it looks like you are losing data that should have been persisted to the disk as WAL files, or even as TSM files, which is odd.
It is unlikely that week-old data is still in the WAL; with the default configuration, that data should have been written to the TSM files which are the permanent storage for data. So you may be losing data that is in TSM files, which points to something wrong with the file system.
Here are the parameters which control when data moves from the cache to TSM files:
- storage-cache-max-memory-size
- storage-cache-snapshot-memory-size
- storage-cache-snapshot-write-cold-duration
And here are parameters for how data is written to the WAL
And, of course, it is best to shut down InfluxDB before a reboot.
I re-tested today. After reboot only data from the last two days were visible. Everything older than that is gone. The *.tsm files before and after reboot are identical (md5). What differs are the fields.idx{l} files. In both situations there were two *.wal files in <bucketid/autogen/highest_number>. One of them being identical as well, the other one slightly bigger than before the reboot. New data has most likely arrived in between.
This InfluxDB instance runs as a daemon controlled by systemd. So on reboot it should receive a proper notification to clean things up. The most strange thing is, that another bucket on the very same instance doesn't show these problems.
fields.idx is rewritten on shutdown and startup, so it would be expected to change. the fields.idxl file is consolidated into a more compact representation in fields.idx.
In the logs on restart, do you see the files which contain the missing data being re-opened? Something like this:
2024-07-17T19:31:54.081632Z info Opened file {"log_id": "0qSYidNl000", "engine": "tsm1", "service": "filestore", "path": "/home/davidby/.influxdb/data/DST_TEST/autogen/133/000000001-000000001.tsm", "id": 0, "duration": "0.084ms"}
Yes. All the files are opened.
service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/131/000000015-000000002.tsm id=0 duration=8.091ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/160/000000020-000000002.tsm id=0 duration=32.630ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/150/000000019-000000002.tsm id=0 duration=36.753ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/180/000000019-000000002.tsm id=0 duration=44.064ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/140/000000020-000000002.tsm id=0 duration=150.393ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/210/000000020-000000002.tsm id=0 duration=180.253ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/170/000000020-000000002.tsm id=0 duration=81.524ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/190/000000020-000000002.tsm id=0 duration=634.996ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/200/000000019-000000002.tsm id=0 duration=312.513ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/230/000000019-000000002.tsm id=0 duration=228.602ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/250/000000020-000000002.tsm id=0 duration=168.714ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/260/000000006-000000001.tsm id=5 duration=95.667ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/220/000000020-000000002.tsm id=0 duration=552.318ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/260/000000003-000000001.tsm id=2 duration=57.136ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/260/000000002-000000001.tsm id=1 duration=135.037ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/260/000000004-000000001.tsm id=3 duration=63.056ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/260/000000001-000000001.tsm id=0 duration=195.773ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/260/000000005-000000001.tsm id=4 duration=54.510ms service=filestore path=/var/lib/influxdb/engine/data/7400f96f21ce1327/autogen/240/000000020-000000002.tsm id=0 duration=445.985ms
Early this morning the latest *-000000001.tsm files got compacted into a single *-000000002.tsm. Since then new data is written into *.wal file(s). Currently two of them exist. One is 10MB in size and has been updated about 3 hours ago. The other is 6MB and growing.
BTW, The parameter you mentioned are all at their defaults.
If I may summarize:
- All your data appears to be present in the file system after a reboot, because the files are unchanged or larger.
- Your reboots should gracefully terminate
influxd - The data loss is sometimes recent data, and sometimes older data
- In one case, older data were purged, but the most recent 2 days were kept
- In other cases, recent data was lost, from the most recent 2 hours to the most recent week, but older data were kept.
- Sometimes no data is lost
The symptoms certainly seem varied and odd. Perhaps try running influxd inspect report-tsm and see if the results differ before and after a reboot, and if the reported time range of the files has changed. There may be other influxd inspect commands that you can run, depending on the size of your instance that produce other diagnostics.
With the wide differences in symptoms, it's hard to think of a single cause for this.
The 3rd point is not correct. I always loose old data. How much of recent date is kept depends. But 2..3 days is 'normal'. 'influxd inspect report-tsm' gives this:
DB RP Shard File Series New Min Time Max Time Load Time
7400f96f21ce1327 autogen 131 000000015-000000002.tsm 567 567 2024-04-16T13:01:22.617446002Z 2024-04-21T23:59:50Z 248.818µs
7400f96f21ce1327 autogen 140 000000020-000000002.tsm 578 11 2024-04-22T00:00:00Z 2024-04-28T23:59:50Z 182.284µs
7400f96f21ce1327 autogen 150 000000019-000000002.tsm 586 8 2024-04-29T00:00:00Z 2024-05-05T23:59:50Z 181.991µs
7400f96f21ce1327 autogen 160 000000020-000000002.tsm 578 0 2024-05-06T00:00:00Z 2024-05-12T23:59:50Z 185.042µs
7400f96f21ce1327 autogen 170 000000020-000000002.tsm 567 0 2024-05-13T00:00:00Z 2024-05-19T23:59:50Z 265.961µs
7400f96f21ce1327 autogen 180 000000019-000000002.tsm 578 0 2024-05-20T00:00:00Z 2024-05-26T23:59:57.60829245Z 204.708µs
7400f96f21ce1327 autogen 190 000000020-000000002.tsm 589 11 2024-05-27T00:00:00Z 2024-06-02T23:59:50Z 155.707µs
7400f96f21ce1327 autogen 200 000000019-000000002.tsm 578 0 2024-06-03T00:00:00Z 2024-06-09T23:59:50Z 123.33µs
7400f96f21ce1327 autogen 210 000000020-000000002.tsm 578 0 2024-06-10T00:00:00Z 2024-06-16T23:59:50Z 201.546µs
7400f96f21ce1327 autogen 220 000000020-000000002.tsm 578 0 2024-06-17T00:00:00Z 2024-06-23T23:59:50Z 144.605µs
7400f96f21ce1327 autogen 230 000000019-000000002.tsm 578 0 2024-06-24T00:00:00Z 2024-06-30T23:59:57.16992822Z 123.592µs
7400f96f21ce1327 autogen 240 000000020-000000002.tsm 578 0 2024-07-01T00:00:00Z 2024-07-07T23:59:50Z 121.468µs
7400f96f21ce1327 autogen 250 000000020-000000002.tsm 578 0 2024-07-08T00:00:00Z 2024-07-14T23:59:50Z 132.125µs
7400f96f21ce1327 autogen 260 000000009-000000002.tsm 578 0 2024-07-15T00:00:00Z 2024-07-18T01:40:40Z 121.943µs
7400f96f21ce1327 autogen 260 000000011-000000001.tsm 578 0 2024-07-18T01:40:47.93204727Z 2024-07-18T10:47:20Z 47.504µs
7400f96f21ce1327 autogen 260 000000012-000000001.tsm 570 0 2024-07-18T10:47:29.823678733Z 2024-07-18T19:50:50Z 52.851µs
7400f96f21ce1327 autogen 260 000000013-000000001.tsm 578 0 2024-07-18T19:51:00Z 2024-07-19T05:16:40Z 60.502µs
Summary: Files: 17
Time Range: 2024-04-16T13:01:22.617446002Z - 2024-07-19T05:16:40Z
Duration: 2248h15m17.382553998s
So theoretically there should be data available from mid 2024-04-16 to now. But data shown starts a 2024-07-15T00:00:00. So it looks like only the last Shard (260) gets presented and the older files are ignored.
I wrote the third point, that sometimes no data was lost because you had written:
In some situations (not always) recent data is lost.
In any event, we see the shards being opened.
- If you are using the TSI index, try influd inspect dump-tsi usinf the various flags to verify that the data you think is present is indexed. Skip this if you are using an in-mem index.
- use influxd inspect verify-tsm to check that your TSM files are all valid.
- Use influxd inspect dump-tsm to see if thee data is present (be careful which flags you use, this can output a lot of data).
I copied the few remaining data over to a newly created bucket. Then I tried to delete to old one, which did not work from the WebUI. There was no reaction when confirming the click on the trashcan icon. It was, however, possible to delete the bucket from the command line. But I had to supply the 'Auth Token' as parameter to 'influx bucket delete'. Otherwise the command would hang infinitely without any error message. After that procedure, it appears to work. Will need more time to verify.
Unfortunately, another issue showed up recently. Whenever I update influxdb2 to any version newer than 2.7.6, some flux queries take ages to complete. I guess, I should open a new ticket for that.
After several weeks of testing I would say that the problem has gone away. To wrap it up:
- for some unknown reason a bucket got corrupted
- this corruption caused any operation on this bucket to become extremely slow or to even hang infinitely
- influxdb2 seems to process shutdown for each bucket sequentially in creation order
- thus, the problem with this single bucket prevented proper shutdown for any other bucket created later
- buckets created before the bad one were unaffected
- removing the corrupted bucket forcefully made things work again
I'll close this ticket now. Thanks for your attention!