arangodb icon indicating copy to clipboard operation
arangodb copied to clipboard

Unjustified "No space left on device" error

Open undercover87 opened this issue 1 year ago • 3 comments

My Environment

  • ArangoDB Version: 3.11.8
  • Deployment Mode: Single Server
  • Deployment Strategy: Kubernetes
  • Configuration:
  • Infrastructure: own
  • Operating System: Linux Alpine 3.16.9
  • Total RAM in your machine: 128gb
  • Disks in use:
  • Used Package: Docker - official Docker library

Component, Query & Data

Affected feature: ArangoDB

Size of your Dataset on disk: dozens of gb when it's not compressed

Steps to reproduce

Not sure

Problem: I'm getting a "No storage space left on device" error occasionally but nearly daily while I'm actively using the db (inserting/deleting data). I check and the storage space isn't full. I also see that some sst and .log (in the journals folder) files are getting really big even 4.6gb In any case, I increased the storage space and still see the error

It's been a while since we have this same setup but never got this error in the past. Not sure what triggers it

Some of the db logs that are suspicious:

ERROR [fae2c] {rocksdb} RocksDB encountered a background error during a compaction operation: IO error: No space left on device: While appending to file: /var/lib/arangodb3/engine-rocksdb/002971.sst: No space left on device; The database will be put in read-only mode, and subsequent write errors are likely. It is advised to shut down this instance, resolve the error offline and then restart it.

this one is from another day 2024-12-10T10:18:51Z [1] ERROR [a5ba8] {engines} unable to apply revision tree removals for _system/_statisticsRaw: Tried to remove key that is not present. 2024-12-10T10:18:51Z [1] ERROR [fdfa7] {engines} unable to apply revision tree updates for _system/_statisticsRaw: Tried to remove key that is not present. 2024-12-10T10:18:51Z [1] WARNING [33691] {engines} _system/_statisticsRaw: caught exception during revision tree serialization: Tried to remove key that is not present. (exception location: /work/ArangoDB/arangod/RocksDBEngine/RocksDBMetaCollection.cpp:1533). Please report this error to arangodb.com

Expected result:

undercover87 avatar Dec 12 '24 15:12 undercover87

Hi, Please note that most linux filesystems keep a space reserve for the root user. You could use tune2fs to adjust that amount: https://askubuntu.com/questions/1014132/can-i-make-the-system-reserve-for-partitions-smaller However, please note that once the disk is completely full, ssh access may stop working.

Please note as well, that deleted files keep their disk space as long as a process keeps a handle on it. So you may see df -h show you free space, but some big file is on the edge of its way to nirvana, but still kept around by running processes.

If you have higher write/delete cycles, you may want to work with the WAL-configuration - see https://docs.arangodb.com/3.12/components/arangodb-server/options/ for more details.

dothebart avatar Dec 12 '24 19:12 dothebart

alright, so to sum up

with tune2fs: the idea is to increase the reserved space that will allow Arango to continue doing basic tasks even if the storage space is depleted? (I understand that currently, once the storage space is depleted, Arango can't even flush/delete older wal files and so it can't recover) EDIT: unless you simply meant reducing the reserved space to gain some storage space

for the WAL-configuration, the idea would be to:

  • accelerate the flushing of wal files say with --rocksdb.auto-flush-check-interval
  • accelerate the deletion of unused wal files with --rocksdb.wal-file-timeout and/or --rocksdb.wal-file-timeout-initial
  • and put a limit to the size of the wal archives with --rocksdb.wal-archive-size-limit

do I get it right?

Also, another question: if I make the storage space "large enough", will this solve the issue?

undercover87 avatar Dec 13 '24 14:12 undercover87

I meant reducing the reserved space - or at least get to know that this is there. Your monitoring (prometheus? collectd? zabbix? grafana?) should show you something like a saw-tooth pattern of used disk space.

You can either lengthen the time available for the ramp by adding more storage space, or shorten the ramping time by more frequently run the compaction. Please note that compaction uses disk & cpu resources and may reduce system resources available to normal operations.

dothebart avatar Dec 13 '24 16:12 dothebart