opengrok
opengrok copied to clipboard
choose different serialization scheme for storing configuration
The XML encoder used for configuration serialization is not very robust (e.g. in the face of changing class hierarchy and removing configuration options) and has some quirks (#2002). We should consider using something else (YAML/JSON ?).
Also, this serialization is used not only for configuration but also elsewhere (IndexAnalysisSettings).
Yes, finally.
Looks like yaml would be the way to go.
Also, the configuration should be treated as data, not serialized objects, to avoid security vulnerabilities that might happen when de-serializing XML into Java objects.
The other reason for using something else is performance. Lately, I realized that XMLEncoder does not scale when retrieving configuration using the RESTful API. When running a multithreaded program where each thread just retrieves the configuration in a loop, where the number of threads matches the number of CPUs, the times shoot up to almost 2 seconds, compared to single threaded program with 0.4 seconds. The XML file with the configuration has some 1.38 MB. When I got a jstack snapshot, it revealed that lots of the XMLEncoder processing threads (like 25 out of the 32 threads I was using) are waiting on internal synchronization object, with top of the stack looking like this:
"http-nio-8080-exec-1427" #29360 daemon prio=5 os_prio=64 cpu=38052.59ms elapsed=2859934.54s tid=0x000000000531c000 nid=0x7981 waiting for monitor entry [0x00007fff808fa000]
java.lang.Thread.State: BLOCKED (on object monitor)
at com.sun.beans.util.Cache.get([email protected]/Cache.java:119)
- waiting to lock <0x00007ff387d7f320> (a java.lang.ref.ReferenceQueue)
at com.sun.beans.finder.MethodFinder.findMethod([email protected]/MethodFinder.java:81)
at java.beans.Statement.getMethod([email protected]/Statement.java:369)
at java.beans.Statement.invokeInternal([email protected]/Statement.java:273)
at java.beans.Statement$2.run([email protected]/Statement.java:187)
at java.security.AccessController.doPrivileged([email protected]/Native Method)
at java.beans.Statement.invoke([email protected]/Statement.java:184)
at java.beans.Expression.getValue([email protected]/Expression.java:155)
at java.beans.Encoder.getValue([email protected]/Encoder.java:105)
at java.beans.Encoder.get([email protected]/Encoder.java:252)
at java.beans.PersistenceDelegate.writeObject([email protected]/PersistenceDelegate.java:112)
at java.beans.Encoder.writeObject([email protected]/Encoder.java:74)
at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:326)
Now, I did this exercise in order to simulate read timeout problems that occur right after running all-project sync using the sync.py command. This command runs number of reindex_project.py programs in parallel and each reindex_project.py retrieves the configuration from the web app at the start. Using --api_timeout with increased value for the Python tools is usable as a workaround, however my expectation is that this should scale.
Another feature that could be brought with new serialization scheme is wildcards. For instance, I'd like to be able to set project properties for a set of projects specified with wildcards (regexps, even), similarly to what is done in opengrok-mirror configuration:
projects:
apache-httpd-.*:
proxy: true
YAML is probably not so great so perhaps using something like TOML might be better idea, however still need to address the need for serialization of objects like Project and RepositoryInfo. Seems like some TOML Java implementations support serialization.