opengrok choose different serialization scheme for storing configuration

The XML encoder used for configuration serialization is not very robust (e.g. in the face of changing class hierarchy and removing configuration options) and has some quirks (#2002). We should consider using something else (YAML/JSON ?).

Also, this serialization is used not only for configuration but also elsewhere (IndexAnalysisSettings).

Aug 31 '18 07:08 vladak

Yes, finally.

Aug 31 '18 07:08 tulinkry

Looks like yaml would be the way to go.

Feb 04 '19 07:02 tulinkry

Also, the configuration should be treated as data, not serialized objects, to avoid security vulnerabilities that might happen when de-serializing XML into Java objects.

Apr 12 '19 08:04 vladak

The other reason for using something else is performance. Lately, I realized that XMLEncoder does not scale when retrieving configuration using the RESTful API. When running a multithreaded program where each thread just retrieves the configuration in a loop, where the number of threads matches the number of CPUs, the times shoot up to almost 2 seconds, compared to single threaded program with 0.4 seconds. The XML file with the configuration has some 1.38 MB. When I got a jstack snapshot, it revealed that lots of the XMLEncoder processing threads (like 25 out of the 32 threads I was using) are waiting on internal synchronization object, with top of the stack looking like this:

"http-nio-8080-exec-1427" #29360 daemon prio=5 os_prio=64 cpu=38052.59ms elapsed=2859934.54s tid=0x000000000531c000 nid=0x7981 waiting for monitor entry  [0x00007fff808fa000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at com.sun.beans.util.Cache.get([email protected]/Cache.java:119)
        - waiting to lock <0x00007ff387d7f320> (a java.lang.ref.ReferenceQueue)
        at com.sun.beans.finder.MethodFinder.findMethod([email protected]/MethodFinder.java:81)
        at java.beans.Statement.getMethod([email protected]/Statement.java:369)
        at java.beans.Statement.invokeInternal([email protected]/Statement.java:273)
        at java.beans.Statement$2.run([email protected]/Statement.java:187)
        at java.security.AccessController.doPrivileged([email protected]/Native Method)
        at java.beans.Statement.invoke([email protected]/Statement.java:184)
        at java.beans.Expression.getValue([email protected]/Expression.java:155)
        at java.beans.Encoder.getValue([email protected]/Encoder.java:105)
        at java.beans.Encoder.get([email protected]/Encoder.java:252)
        at java.beans.PersistenceDelegate.writeObject([email protected]/PersistenceDelegate.java:112)
        at java.beans.Encoder.writeObject([email protected]/Encoder.java:74)
        at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:326)

Now, I did this exercise in order to simulate read timeout problems that occur right after running all-project sync using the sync.py command. This command runs number of reindex_project.py programs in parallel and each reindex_project.py retrieves the configuration from the web app at the start. Using --api_timeout with increased value for the Python tools is usable as a workaround, however my expectation is that this should scale.

Mar 28 '22 12:03 vladak

Another feature that could be brought with new serialization scheme is wildcards. For instance, I'd like to be able to set project properties for a set of projects specified with wildcards (regexps, even), similarly to what is done in opengrok-mirror configuration:

projects:
  apache-httpd-.*:
     proxy: true

Oct 19 '22 15:10 vladak

YAML is probably not so great so perhaps using something like TOML might be better idea, however still need to address the need for serialization of objects like Project and RepositoryInfo. Seems like some TOML Java implementations support serialization.

Dec 01 '22 13:12 vladak

opengrok opengrok copied to clipboard

choose different serialization scheme for storing configuration

opengrok
opengrok copied to clipboard