athens
athens copied to clipboard
Fluree crash leaves Athens in bad state
Problem
Fluree crashes with OOM on 4GB AWS instance (t1a.medium) with smallish graph, and Athens (apparently) doesn't try to reconnect. docker-compose
apparently doesn't try to bring it back up automatically. docker-compose restart
fixes the problem.
Granted, I'm using docker-compose up -d athens
to avoid using nginx
, so it may have something to do with it, but doubtful.
Screenshots/Demo
fluree_1 | 2022-01-25 18:54:20,953 ERROR f.db.ledger.transact - Fatal error, after an error processing a block an unexpected error happened trying to remove the involved transactions from raft state: ("503d895ee4aed8a0dc1d0e0a918f36e633ada861510d0343af8ecca23d684d28") - clojure.lang.ExceptionInfo: Command timed out.\n at fluree.raft.events$register_callback_event$fn__64081$state_machine__5237__auto____64094$fn__64097.invoke(events.clj:130)\n at fluree.raft.events$register_callback_event$fn__64081$state_machine__5237__auto____64094.invoke(events.clj:122)\n at clojure.core.async.impl.ioc_macros$run_state_machine.invokeStatic(ioc_macros.clj:978)\n at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:977)\n at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invokeStatic(ioc_macros.clj:982)\n at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:980)\n at clojure.core.async$ioc_alts_BANG_$fn__5466.invoke(async.clj:421)\n at clojure.core.async$do_alts$fn__5405$fn__5408.invoke(async.clj:288)\n at clojure.core.async.impl.channels.ManyToManyChannel$fn__797.invoke(channels.clj:265)\n at clojure.lang.AFn.run(AFn.java:22)\n at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n at clojure.core.async.impl.concurrent$counted_thread_factory$reify__635$fn__636.invoke(concurrent.clj:29)\n at clojure.lang.AFn.run(AFn.java:22)\n at java.base/java.lang.Thread.run(Thread.java:829)\n
fluree_1 | 2022-01-25 18:54:55,705 INFO fluree.db.server - SHUTDOWN Start -
fluree_1 | 2022-01-25 18:55:26,356 INFO fluree.db.ledger.stats - Memory: {"used":"0.7 GB","committed":"1.6 GB","max":"2.0 GB","init":"1.0 GB","time":"2022-01-25T18:55:26.204571Z"} -
...
fluree_1 | 2022-01-25 19:01:23,192 INFO fluree.db.ledger.stats - Group state: {"version":3,"leases":{"servers":{"myserver":{"id":"myserver","expire":1643137282289}}},"_work":{"networks":{"events":"myserver"}},"networks":{"events":{"dbs":{"log":{"status":"ready","block":1147,"index":742,"indexes":{"1":1642711235209,"353":1642803692749,"742":1642832535977}}}}}} -
fluree_1 | #
fluree_1 | # There is insufficient memory for the Java Runtime Environment to continue.
fluree_1 | # Native memory allocation (mmap) failed to map 16384 bytes for committing reserved memory.
fluree_1 | # An error report file with more information is saved as:
fluree_1 | # /opt/fluree/hs_err_pid1.log
fluree_1 | [thread 52 also had an error]
fluree_1 | Java version 11.
...
athens_1 | 19:04:40.676 WARN [async-dispatch-3] fluree.db.util.log - "Server contact error: " "xhttp error - http://fluree:8090/fdb/health - Don't know how to convert into class java.lang.String" {:url "http://fluree:8090/fdb/health", :error :xhttp/unknown-error}
athens_1 | 19:05:48.132 WARN [async-dispatch-4] fluree.db.util.log - "Connection has gone stale. Perhaps network conditions are poor. Disconnecting socket."
...
fluree_1 | 2022-01-25 19:04:43,188 INFO fluree.db.server - JVM arguments: {:jvm "OpenJDK 64-Bit Server VM", :input ["-Xmx2g" "-Xms1g" "-XX:+UseG1GC" "-XX:MaxGCPauseMillis=50" "-Dfdb-storage-file-root=/var/lib/fluree/" "-Dfdb-group-log-directory=/var/lib/fluree/group/" "-Dfdb.properties.file=./fluree_sample.properties" "-Dfdb.log.ansi" "-Dlogback.configurationFile=./logback.xml"]} -
fluree_1 | 2022-01-25 19:04:43,202 INFO fluree.db.server - Memory Info: {:used 0.3 GB, :committed 1.7 GB, :max 2.0 GB, :init 1.0 GB, :time 2022-01-25T19:04:43.194182Z} -
# While SSH'ed into the machine
curl localhost:3010
# =>
# curl: (7) Failed to connect to localhost port 3010: Connection refused
Athens Version v2.0.0-beta.12
I think what's happening here is:
- the java process in the fluree container says it doesn't have enough memory, and kills itself
- the athens process in the athens container tries to connect to fluree, but can't, and just hangs there indefinitely
- the fluree container tries to restart its java process continuously, maybe succeeding, maybe failing
- the athens process doesn't try to connect again
We've seen a similar problem in our server when we were indeed out of memory due to other things running in the background. So I think the way forward for you is to either increase the memory on that server (we use 8gb in ours, but we have more data too I think), or to check if there's something else eating up the memory in that server.
Makes total sense and lines up with what I was seeing. There’s nothing else on the server so we’ll have to bump memory. Strange though, because not only are we not running nginx, but we don’t have much data yet. I suppose 2GB each for fluree and Athens is pretty paltry for Clojure.
To be honest is really surprises me that you're running into memory problems on a small graph.
Our team graph only showed those problems after several months of use, and because we were using several gigs of memory in other background processes in that machine.
Fluree itself needs about 1gb to run (but this can be adjusted I think) and the Athens server needs about 2gb (we've spent 0 effort optimising this yet).
Also surprised! Yeah we don't have any background processes other than those required to run Linux and Docker. I wonder if it's either 1) there's native memory used in addition to heap memory, or 2) because htop
says there's 3.8 GB of total memory, not 4GB. grep MemTotal /proc/meminfo
says 3989320 kB
which seems right. Perhaps we can try giving -Xmx
a buffer value of, say, 10%.
My server happened to crash; I tried the solutions listed but fluree keeps being unhealthy, and I didn't setup the backup yet. At this point it appears the backup utility can't connect to the fluree database and can't produce the backup. Is my data lost forever?