orientdb
orientdb copied to clipboard
Full sync fails with "Unepxected end of ZLIB stream"
OrientDB Version: 3.2.10
Java Version: openjdk version "11.0.16.1"
OS: alpine (running in OpenShift)
We have a cluster running with 3 master instances. The database consists of about 1800 files with a total size of 16GB. When another instance (a replica), with empty database joins the cluster, a full sync is started to replicate the database.
Expected behavior
The full sync should succeed.
Actual behavior
The full sync runs for a while, and then seems to get "stuck" (no log output for some time), after which it fails with the exception
Unepxected end of ZLIB stream
Not an answer for why this happens, but in my experience the sync behaviour with the enterprise agent (which is now open source) installed is far more robust. For a start it will do incremental syncs, but it's also based on a log structured incremental backup rather than a (frankly quite scary) streaming of a full backup zip file across the network. It might be more fruitful investing in adding the agent to your deploys and testing that approach (it will also change the backup process).
@timw thanks for the hint. I tried adding the enterprise agent by copying the jar file into the OrientDB plugins folder. When starting the server, OrientDB tries to install it as a dynamic plugin, but fails to do so with the following error:
2023-03-28 12:57:50:278 INFO Installing dynamic plugin 'agent.jar'... [OServerPluginManager]
2023-03-28 13:00:01:191 SEVER Error on installing dynamic plugin 'enterprise-agent' [OServerPluginManager]
java.lang.ClassNotFoundException: com.orientechnologies.agent.OEnterpriseAgent
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:315)
at com.orientechnologies.orient.server.plugin.OServerPluginManager.startPluginClass(OServerPluginManager.java:265)
at com.orientechnologies.orient.server.plugin.OServerPluginManager.installDynamicPlugin(OServerPluginManager.java:378)
at com.orientechnologies.orient.server.plugin.OServerPluginManager.updatePlugin(OServerPluginManager.java:200)
at com.orientechnologies.orient.server.plugin.OServerPluginManager.updatePlugins(OServerPluginManager.java:305)
at com.orientechnologies.orient.server.plugin.OServerPluginManager.startup(OServerPluginManager.java:91)
...
The problem seems to be that OServerPluginManager tries to load the plugin class (com.orientechnologies.agent.OEnterpriseAgent) without adding the jar file to the classpath.
Any hints on how to fix this? We are using OrientDB 3.2.10 embedded in our own application.
Hi,
So for the Unexpected end of ZLIB stream
this usually happen when the sync fail, anyway the server should try to restart the sync again, if you can reproduce this, it would be useful to have thread dumps of when the server is stuck.
for the error of the agent.jar, that look strange, like a corrupted jar.
Regards
Hi,
I could not (yet) reproduce this locally, so I cannot provide thread dumps. The used database is rather big, so the full sync consists of about 410 chunks with size 8MB each. When the sync succeeds, it takes about 16 minutes. We can see that the sync is restarted, if it fails, but in most cases, it will fail again and everything starts all over again.
Regarding the agent.jar: I used the one from maven central (OrientDB version 3.2.10), so I think it is unlikely that it is corrupted. However, I just do not understand, how loading the class should work:
Here, a classloader is created, that will load classes from the agent.jar: https://github.com/orientechnologies/orientdb/blob/2486dd95b4df421b5de9a2e773e3da03928fe027/server/src/main/java/com/orientechnologies/orient/server/plugin/OServerPluginManager.java#L323
But that classloader is not used here, when the class should be loaded: https://github.com/orientechnologies/orientdb/blob/2486dd95b4df421b5de9a2e773e3da03928fe027/server/src/main/java/com/orientechnologies/orient/server/plugin/OServerPluginManager.java#L378
Maybe I am missing something here?
Hi,
Yep that look strange, I will have a double check.
Regards
Hi,
I changed the plugin loading to use the correct class loader, and is already released in 3.2.18, keeping it open for the other problem
Hi @tglman, I will definitely have another go at the enterprise agent, so thanks for fixing the class loading problem.
Hi @npomaroli,
One thing you need to be aware, is if you are using the lucene indexes you may have problems, because the agent based sync do not support them well yet.
Regards
@tglman - slightly tangential, but what is the issue with Lucene indexes and the enterprise agent sync? (We're exploring moving to the enterprise agent for the better sync/backup performance, but Lucene indexes are really core to our use).
Hi @timw,
So the issues is that the enterprise agent sync use as underlying data extraction the incremental backup, that is based on our paginated storage file management, the Lucene indexes do not use our storage file management, they use the standard Lucene files on a side, so practically are not included and need to be rebuilt. The base sync use the base backup, that just zip the folder.