jackrabbit-oak icon indicating copy to clipboard operation
jackrabbit-oak copied to clipboard

OAK-11165 - Cache the Path field in the NodeDocument class.

Open nfsantos opened this issue 1 year ago • 5 comments

Creating a Path instance from a NodeDocument is an expensive operation, so it is worth to cache it. The performance improvements are noticeable in the indexing job, which calls #getPath twice in each NodeDocument that it processes.

Similar idea as what is done here: https://github.com/apache/jackrabbit-oak/blob/35950bec988fffe34708f635f29f7d268d54cb85/oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/Path.java#L297-L306

nfsantos avatar Oct 03 '24 06:10 nfsantos

Are there any concerns about increased memory usage by caching the Path object in memory in every case?

I understand it can improve performance on indexing job, but it could increase memory consumption during normal operations of the repository. Have you evaluated this impact?

Maybe we could have a way to enable/disable the cache of this object depending on the needs. For example, enable it as part of indexing, but keep it disabled on other parts, at least until we assess any possible impact.

Joscorbe avatar Oct 03 '24 09:10 Joscorbe

Are there any concerns about increased memory usage by caching the Path object in memory in every case?

I understand it can improve performance on indexing job, but it could increase memory consumption during normal operations of the repository. Have you evaluated this impact?

Maybe we could have a way to enable/disable the cache of this object depending on the needs. For example, enable it as part of indexing, but keep it disabled on other parts, at least until we assess any possible impact.

I have not evaluated the potential increase in memory in other situations, I was hoping someone more knowledgeable in this module would be able to comment. The main place where there could be a increased in memory usage that I can think of is in the NodeDocument cache. If getPath is called on the documents in the cache and these instances of Path would not be held otherwise anywhere else, then the memory usage will increase. I don't know if this could be an issue.

I was hoping that in most cases the lifetime of the Path instance created by calling #getPAth will be similar to the one of the NodeDocument from where it was created, so there would be no increase in memory.

I'm ok with a way to enabled/disable this feature, could be a static flag set from an env variable with a default value of disabled. And we would enable it only for indexing jobs. But this optimization could benefit other uses of Oak, but I'm not sure how to evaluate it.

nfsantos avatar Oct 03 '24 09:10 nfsantos

The reason why calling getPath is so expensive are the calls to StringCache for each path segment. I am assuming this is done to save memory, as two different paths often have many path segments in common. On the other hand, this requires computing the hash of every path segment, which might not be otherwise needed in many situations. And this cache might also be ineffective because it only has capacity for 1024 elements and when traversing an Oak tree we will come across a very large number of different names. We can assume that the leaves will all have different names, so traversing 1024 nodes will completely trash the cache, pushing out the common names that appear in the parent segments. I have not done a deep analysis, but I suspect this StringCache might not be reducing significantly the memory usage and is increasing CPU usage. On the indexing job, the calls to StringCache.get are responsible for more than 10% of the total CPU usage, which seems quite high.

nfsantos avatar Oct 03 '24 10:10 nfsantos

I think there is a very high likelihood that getPath is executed by the time it gets put into the cache, as for example getNodeAtRevision already calls it, which is very likely to be called. Perhaps having an idea by how much the memory would increase in a realistic scenario would be useful.

stefan-egli avatar Oct 03 '24 12:10 stefan-egli

Compared to the "data" map, memory usage of the path is small. To implement estimateMemoryUsage, the memory usage was estimated once, and the overheads (at that time, with an old JVM) are listed: 112 bytes overhead for just an entry, 64 bytes per map entry.

thomasmueller avatar Oct 24 '24 07:10 thomasmueller