crate
crate copied to clipboard
OOM Deleting old snapshots from and s3 repository
CrateDB version
4.8.1
CrateDB setup information
CrateDB Cloud CR0 instance - 2 vCPU, 2 GiB RAM, 4 GiB storage.
CRATE_HEAP_SIZE: 512m
CRATE_JAVA_OPTS="-Dcom.sun.management.jmxremote.port=6666 -Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.rmi.port=6666 -Djava.rmi.server.hostname=127.0.0.1
-javaagent:/var/lib/crate/crate-jmx-exporter-1.0.0.jar=7071 -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/resource/heapdump -Dlog4j2.formatMsgNoLookups=true"
Steps to Reproduce
This particular database does not have a lot of data - 27k records, about 600MiB storage used - but has a few very wide tables with 1000 columns.
Can provide snapshot with all the data to restore. Can provide heap dump.
Both too large to attach to ticket.
Expected Result
Creating a snapshot succeeds, but then DROP SNAPSHOT
fails with an OOM, after hammering the GC for some time.
It should probably circuit-break and fail the DROP instead?
Actual Result
OOM
@SStorm Sorry for the delay, this slipped through. Could you provide me the heap dump please? And also access to the snapshot if possible. Thank you!
Snapshot and heap dump shared privately.
After looking into the heap dump and provided logs, not only snapshot deletion caused OOM exceptions but also regular writes as this node was under very high memory pressure. It runs with 512MB HEAP only, which is below our recommendation of 1GB for any different use cases than very simple evaluation or minimal usage. Tuning our circuit breaker logic further to be more accurate in it's estimation could result in lot of effort, which we think isn't worth it for this rather unusual scenario. Instead we advise to increase the HEAP to at least 1GB.