OAK-10803 -- compress/uncompress property
I believe the main concern is that we don't know yet exactly why we are doing this. IMHO it would be better to first collect data on the size of the properties.
looks like the original intention was that properties shouldn't or wouldn't be used as keys in hash maps... if that's the case then it should be fine..
WRT to this PR we should just ensure that compression does not affect how MongoDS currently behaves.
Why is this a MongoDB specific aspect? For MongoDB we know it is not supported (see this skip here) - but for RDB it is supported. So shouldn't this PR keep the support for RDB?
Why is this a MongoDB specific aspect? For MongoDB we know it is not supported (see this skip here) - but for RDB it is supported. So shouldn't this PR keep the support for RDB?
I believe we should come up with consistent behaviour for all types of persistences. My preference would be to disallow Strings with broken UTF-8 serializations, but that would be a change for segment persistence, and that's why I faced resistence when I proposed that back then.
For this PR, it will have to handle what the current system allows.
With the default set to disable compression, in theory this PR shouldn't do no harm and could be merged, as changing the default is a conscious decision that could be done based on the requirement of the system and which DocumentStore is used.
Having said that, I think the current state of testInterestingStrings isn't ideal, as it lists a broken surrogate in a test string, but then doesn't test it. So I'd say either it uses that broken surrogate in a test, or then shouldn't list it at all. Except, I think now that this problem with broken surrogate was identified, we need to address it, so we probably should have it in the test indeed.
What we might do is have the test use different DocumentStore fixtures and test on memory, mongo and rdb with the broken surrogate property and verify that the behavior with or without compression is the same.
That could mean that for mongo the fact that compression doesn't support broken surrogates turns out as a non-issue as mongo already doesn't support it. It would just be good to have a test that verifies that.
For RDB it could be different: it could mean that if you want to use compression for RDB you'd have to accept that broken surrogates are not supported (unless we find a fast way to support it in compression).
@reschke @reschke I created methods to check the behavior with broken surrogate before and after introducing compression. After compression the test failed.
bq. For RDB it could be different: it could mean that if you want to use compression for RDB you'd have to accept that broken surrogates are not supported (unless we find a fast way to support it in compression).
The elephant in the room is segment persistence, which IMHO allows any Java string. I don't believe discussing the differences of document store backends is helpful here; we should come to an agreement what the expections for Oak in general are. Having document store and segment store behave differently IMHO is a problem we should solve (by fixing segment store, would be my preference).
@reschke I vote for your preference(fixing segment store)
Turns out that it might look slightly different altogether:
In the latest test from @ionutzpi where it passes a broken surrogate through NodeBuilder (except it shouldn't also go via DocumentPropertyState there), it can be seen that when it applies a property to the commit, as part of persist / compareAgainstBaseState, it does go through JsonSerializer, which handles broken surrogates properly.
So in short : DocumentNodeStore does actually handle broken surrogates in property values fine.
What I think happened in that original testInterestingStrings, is that it the DocumentStore API directly - instead of the DocumentNodeStore API...
Wrt this PR I think there are a few things still open though :
- beautify the newly introduced tests - they have a few issue, namely :
getBrokenSurrogateAndInitializeDifferentStorescallscreatePropAndCheckValue3 times dependent on the DocumentStore used, except there's no difference between the 3 calls, can do that in 1 call.- the way it uses the fixture is different from how it is classically done through a parameterized test. either I'd suggest to consider a parameterized test (where it would dispose the fixture at the end) or make sure to dispose the store/documentNodeStore created. The way it is done now leaves the DocumentNodeStore undisposed, which means follow-up tests hit an active lease and wait for lease expiration, which extends the test unnecessarily
createPropAndCheckValueas mentioned should go viaDocumentPropertyStatewhen setting the property value. that is not how it happens in existing code.DocumentPropertyStateis only used for reading. Writing happens via builder which uses things like ModifiedDocumentNodeState, goes via jcr ValueImpl, StringPropertyState etc.
- additionally though I believe there is a bigger issue : also based on the new test, when debugging through compression/decompression it can be seen that immediately after a
DocumentPropertyStateis created and with compression the value is compressed, it goes ahead and wants to put it into the cache and for that wants to know its size, so it goes an callsparsed()- several times - as part of which it calls decompression - several times. I think this highlights a bigger problem, that we should see if/how compression is performant. It seems the way it is now would add a lot of decompression overhead. This needs further improvement in my view.
wdyt?
After @stefan-egli insights and looking on performance implication for whole properties compression/decompression process, my opinion is that it doesn't brink to much value to continue.