dkg-engine icon indicating copy to clipboard operation
dkg-engine copied to clipboard

Online Blazegraph backup corrupts the blazegraph.jnl file

Open Valcyclovir opened this issue 8 months ago • 3 comments

Issue description

The Blazegraph Backup API (/blazegraph/backup) produces a corrupted blazegraph.jnl file when used for online backups while OTNode (OriginTrail V8 Node) is running. The backup file, generated with block=true and with or without compress=true, fails to restore properly, resulting in a java.lang.IllegalStateException: Invalid data checksum error when starting Blazegraph with the restored journal. This causes OTNode to fail with a "Cannot connect to Triple store" error and Blazegraph to return HTTP 503 Service Unavailable for SPARQL queries.

Expected behavior

The Backup API should produce a consistent, uncorrupted blazegraph.jnl file that can be restored to a functional Blazegraph instance, allowing OTNode to connect to the triple store and SPARQL queries to execute without errors.

Actual behavior

The backup file (blazegraph-backup.jnl or blazegraph-backup.jnl.gz) is corrupted. When used to replace the active blazegraph.jnl, Blazegraph fails to start, logging a java.lang.IllegalStateException: Invalid data checksum from address: 72130541568, size: 1104. OTNode reports "Cannot connect to Triple store (OtBlazegraph), repository: privateCurrent, located at: http://localhost:9999/ retry number: 2/10". SPARQL queries to http://localhost:9999/blazegraph/namespace/dkg/sparql return HTTP 503 Service Unavailable.

Steps to reproduce the problem

  1. Restart Blazegraph and OTNode
systemctl restart blazegraph otnode
  1. Run the Backup API command:
BLAZE_URL="http://localhost:9999/blazegraph/backup?block=true&compress=true"
BLAZE_OUTPUT_FILE="/root/blazegraph-backup.jnl.gz"
curl -X POST --data-urlencode "file=${BLAZE_OUTPUT_FILE}" "${BLAZE_URL}"

Alternatively, do not compress:

BLAZE_URL="http://localhost:9999/blazegraph/backup?block=true"
BLAZE_OUTPUT_FILE="/root/blazegraph-backup.jnl"
curl -X POST --data-urlencode "file=${BLAZE_OUTPUT_FILE}" "${BLAZE_URL}"

If compressed, decompress the backup:

gunzip /root/blazegraph-backup.jnl.gz
  1. Stop Blazegraph and OTNode, replace the active blazegraph.jnl with blazegraph-backup.jnl
systemctl stop blazegraph otnode
mv /root/ot-node/blazegraph.jnl /root/ot-node/blazegraph.jnl.bak
mv /root/blazegraph-backup.jnl /root/ot-node/blazegraph.jnl
  1. Restart both services
systemctl restart blazegraph 
sleep 5s
systemctl restart otnode
  1. Observe OTNode error: "Cannot connect to Triple store (OtBlazegraph), repository: privateCurrent, located at: http://localhost:9999/ retry number: 2/10".
  2. Run a SPARQL query:
curl -X POST http://localhost:9999/blazegraph/namespace/dkg/sparql -H "Content-Type: application/sparql-query" --data 'SELECT (COUNT(*) AS ?totalTriples) WHERE { ?s ?p ?o }'

Observe response: HTTP 503 Service Unavailable.

Specifications

Node version: OriginTrail node v8.0.11 Platform: Ubuntu 24.04 LTS Node wallet: 0xe5Cc7fd75E87fD26EB6557236FE29566365Ba267 Node libp2p identity: 37

Error logs

Blazegraph logs (after restoring backup and restarting):

May 14 17:33:15 othub3 java[11878]: ERROR: Banner.java:134: Could not resolve name for host: java.net.UnknownHostException: othub3: othub3: Name or service not known May 14 17:33:15 othub3 java[11878]: WARN : Banner.java:136: Falling back to null May 14 17:33:15 othub3 java[11878]: WARN : NanoSparqlServer.java:517: Starting NSS May 14 17:33:15 othub3 java[11878]: WARN : WebAppContext.java:554: Failed startup of context o.e.j.w.WebAppContext@5b94b04d{Bigdata,/blazegraph,jar:file:/root/ot-node/blazegraph.jar!/war,UNAVAILABLE}{jar:file:/root/ot-node/blazegraph.jar!/war} May 14 17:33:15 othub3 java[11878]: java.lang.RuntimeException: java.lang.RuntimeException: addr=-19608250 : cause=java.lang.IllegalStateException: Invalid data checksum from address: 72130541568, size: 1104 May 14 17:33:15 othub3 java[11878]: at com.bigdata.rdf.sail.webapp.BigdataRDFServletContextListener.openIndexManager(BigdataRDFServletContextListener.java:816) ... Caused by: java.lang.IllegalStateException: Invalid data checksum from address: 72130541568, size: 1104 May 14 17:33:15 othub3 java[11878]: at com.bigdata.rwstore.RWStore.getData(RWStore.java:2378) ...

SPARQL query response:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 503 Service Unavailable</title>
</head>
<body><h2>HTTP ERROR 503</h2>
<p>Problem accessing /blazegraph/namespace/dkg/sparql. Reason:
<pre>    Service Unavailable</pre></p><hr><a href="http://eclipse.org/jetty">Powered by Jetty:// 9.4.z-SNAPSHOT</a><hr/>
</body>
</html>

OTNode error:

Cannot connect to Triple store (OtBlazegraph), repository: privateCurrent, located at: http://localhost:9999 retry number: 2/10

Disclaimer

Please be aware that the issue reported on a public repository allows everyone to see your node logs, node details, and contact details. If you have any sensitive information, feel free to share it by sending an email to [email protected] (mailto:[email protected]).

Valcyclovir avatar May 15 '25 16:05 Valcyclovir

Thanks @Valcyclovir for the detailed submission.

@Mihajlo-Pavlovic @marko03kostic @ilijaMar let's hop on this one asap

branarakic avatar May 16 '25 09:05 branarakic

I also had no success with online blazegraph backup process. It produced corrupted journal file which could not be used (blazegraph complained on corruption / invalid checksum during start up).

botnumberseven avatar Jun 03 '25 19:06 botnumberseven

I could not make online blazegraph backup work, so I'm using zfs snapshot based backup now. Blazegraph.jnl is stored on zfs partition, to create a backup:

  1. stop node, stop blazegraph
  2. create zfs snapshot (almost instantly)
  3. start blazegraph, start node
  4. backup snapshot to another VPS with zfs partition

Node downtime is measured in seconds. zfs has its own specifics thought, like built-in compression (good), higher CPU use (not so good).

botnumberseven avatar Jun 11 '25 13:06 botnumberseven