orientdb icon indicating copy to clipboard operation
orientdb copied to clipboard

Full sync fails with "Unepxected end of ZLIB stream"

Open npomaroli opened this issue 1 year ago • 10 comments

OrientDB Version: 3.2.10

Java Version: openjdk version "11.0.16.1"

OS: alpine (running in OpenShift)

We have a cluster running with 3 master instances. The database consists of about 1800 files with a total size of 16GB. When another instance (a replica), with empty database joins the cluster, a full sync is started to replicate the database.

Expected behavior

The full sync should succeed.

Actual behavior

The full sync runs for a while, and then seems to get "stuck" (no log output for some time), after which it fails with the exception

Unepxected end of ZLIB stream

npomaroli avatar Mar 24 '23 13:03 npomaroli

Not an answer for why this happens, but in my experience the sync behaviour with the enterprise agent (which is now open source) installed is far more robust. For a start it will do incremental syncs, but it's also based on a log structured incremental backup rather than a (frankly quite scary) streaming of a full backup zip file across the network. It might be more fruitful investing in adding the agent to your deploys and testing that approach (it will also change the backup process).

timw avatar Mar 27 '23 22:03 timw

@timw thanks for the hint. I tried adding the enterprise agent by copying the jar file into the OrientDB plugins folder. When starting the server, OrientDB tries to install it as a dynamic plugin, but fails to do so with the following error:

2023-03-28 12:57:50:278 INFO  Installing dynamic plugin 'agent.jar'... [OServerPluginManager]
2023-03-28 13:00:01:191 SEVER Error on installing dynamic plugin 'enterprise-agent' [OServerPluginManager]
java.lang.ClassNotFoundException: com.orientechnologies.agent.OEnterpriseAgent
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:315)
	at com.orientechnologies.orient.server.plugin.OServerPluginManager.startPluginClass(OServerPluginManager.java:265)
	at com.orientechnologies.orient.server.plugin.OServerPluginManager.installDynamicPlugin(OServerPluginManager.java:378)
	at com.orientechnologies.orient.server.plugin.OServerPluginManager.updatePlugin(OServerPluginManager.java:200)
	at com.orientechnologies.orient.server.plugin.OServerPluginManager.updatePlugins(OServerPluginManager.java:305)
	at com.orientechnologies.orient.server.plugin.OServerPluginManager.startup(OServerPluginManager.java:91)
	...

The problem seems to be that OServerPluginManager tries to load the plugin class (com.orientechnologies.agent.OEnterpriseAgent) without adding the jar file to the classpath.

Any hints on how to fix this? We are using OrientDB 3.2.10 embedded in our own application.

npomaroli avatar Mar 28 '23 11:03 npomaroli

Hi,

So for the Unexpected end of ZLIB stream this usually happen when the sync fail, anyway the server should try to restart the sync again, if you can reproduce this, it would be useful to have thread dumps of when the server is stuck.

for the error of the agent.jar, that look strange, like a corrupted jar.

Regards

tglman avatar Apr 03 '23 12:04 tglman

Hi,

I could not (yet) reproduce this locally, so I cannot provide thread dumps. The used database is rather big, so the full sync consists of about 410 chunks with size 8MB each. When the sync succeeds, it takes about 16 minutes. We can see that the sync is restarted, if it fails, but in most cases, it will fail again and everything starts all over again.

Regarding the agent.jar: I used the one from maven central (OrientDB version 3.2.10), so I think it is unlikely that it is corrupted. However, I just do not understand, how loading the class should work:

Here, a classloader is created, that will load classes from the agent.jar: https://github.com/orientechnologies/orientdb/blob/2486dd95b4df421b5de9a2e773e3da03928fe027/server/src/main/java/com/orientechnologies/orient/server/plugin/OServerPluginManager.java#L323

But that classloader is not used here, when the class should be loaded: https://github.com/orientechnologies/orientdb/blob/2486dd95b4df421b5de9a2e773e3da03928fe027/server/src/main/java/com/orientechnologies/orient/server/plugin/OServerPluginManager.java#L378

Maybe I am missing something here?

npomaroli avatar Apr 03 '23 13:04 npomaroli

Hi,

Yep that look strange, I will have a double check.

Regards

tglman avatar Apr 03 '23 14:04 tglman

Hi,

I changed the plugin loading to use the correct class loader, and is already released in 3.2.18, keeping it open for the other problem

tglman avatar Apr 19 '23 17:04 tglman

Hi @tglman, I will definitely have another go at the enterprise agent, so thanks for fixing the class loading problem.

npomaroli avatar Apr 20 '23 06:04 npomaroli

Hi @npomaroli,

One thing you need to be aware, is if you are using the lucene indexes you may have problems, because the agent based sync do not support them well yet.

Regards

tglman avatar Apr 21 '23 12:04 tglman

@tglman - slightly tangential, but what is the issue with Lucene indexes and the enterprise agent sync? (We're exploring moving to the enterprise agent for the better sync/backup performance, but Lucene indexes are really core to our use).

timw avatar May 10 '23 09:05 timw

Hi @timw,

So the issues is that the enterprise agent sync use as underlying data extraction the incremental backup, that is based on our paginated storage file management, the Lucene indexes do not use our storage file management, they use the standard Lucene files on a side, so practically are not included and need to be rebuilt. The base sync use the base backup, that just zip the folder.

tglman avatar May 10 '23 10:05 tglman