feat(clientdata): Add viewport and scrollable size values to clientdata
Improvement Request
There ara multitude of issues reported where the driver would report a variation of the following message:
java.lang.UnsupportedOperationException: io.lettuce.core.output.ValueOutput does not support set(long)
As a hypothesis we believe that some error, such as - but not necessarily - OutOfMemory, causes the event loop to become out of sync.
See https://github.com/redis/lettuce/issues/3117#issuecomment-2575442585
Related issues:
- [ ] #2175
- [ ] #3117
- [x] #3121
- [x] #2864
- [x] #2012
Describe the solution you'd like
A suggested solution is to catch such errors, terminate the connection and complete any remaining commands exceptionally.
We need to then verify the solution against the scenarios shown above (where possible and some MRE exists)
Describe alternatives you've considered
N/A
@tishun
As a hypothesis we believe that some error, such as - but not necessarily - OutOfMemory, causes the event loop to become out of sync.
I hope this can be helpful in some way:
The out-of-sync state can be reproduced in the existing test case subscriberCompletingWithExceptionShouldBeHandledSafely by substituting this RuntimeException with any kind of JVM fatal error, such as a LinkageError. It produces this failure:
[ERROR] Failures:
[ERROR] ReactiveConnectionIntegrationTests.subscriberCompletingWithExceptionShouldBeHandledSafely expectation "expectNext(valueB)" failed (expected value: valueB; actual value: valueA)
Alternatively, this more extensive test case can be used to illustrate the behaviour where GET(keyX) -> valueY:
@Test
@Inject
void showcaseInconsistentResponsesOnJvmFatal(RedisClient client) {
// A separate reactive connection is created to avoid interference with other tests
RedisReactiveCommands<String, String> reactiveCommands = client.connect().reactive();
// Prepare a state in the Redis
StepVerifier.create(Flux.concat(
reactiveCommands.set("keyA", "valueA"),
reactiveCommands.set("keyB", "valueB"),
reactiveCommands.set("keyC", "valueC"),
reactiveCommands.set("keyD", "valueD")
)
).expectNextCount(4)
.verifyComplete();
/**
* Purposefully throw a JVM fatal error in the onNext callback.
*
* The error is not propagated to the subscriber.
* Therefore, the StepVerifier can not be used, as it would hang indefinitely.
*
* The throwable needs to be a "jvmFatal" as defined by:
* {@link reactor.core.Exceptions#isJvmFatal(Throwable)}
*/
reactiveCommands.get("keyA").doOnNext(foo -> { throw new LinkageError("jvmFatal1"); }).subscribe();
// Unlike a normal test case, these assertions are used to demonstrate the inconsistent behavior
StepVerifier.create(reactiveCommands.get("keyB")).expectError(LinkageError.class).verify();
StepVerifier.create(reactiveCommands.get("keyC")).expectNext("valueB").verifyComplete();
StepVerifier.create(reactiveCommands.get("keyD")).expectNext("valueC").verifyComplete();
}
Hey @henriklundstrom ,
thanks for the sample code! As soon as the team has prioritized this item we will go over it and try to provide a fix.
See also https://github.com/nordnet/lettuce-out-of-order
Looking at the related issues reported I am wondering if this is only a problem when using reactive API.
Is it possible to draw any conclusions that this is not a problem with async API?
I don't think any of the APIs is safe from this issue.
Looking at the related issues reported I am wondering if this is only a problem when using reactive API.
Is it possible to draw any conclusions that this is not a problem with async API?
We have this issue reproduced on async API. So far I can see that the issues comes after some OutOfMemory error.
I must say it seems like other APIs can survive one-time OOMs but not lettuce. So I am looking into catching the Errors and resetting the open connections somehow, e.g. connection.close(); connection = client.connect(); or recreating the RedisClient. Would it reset the channels sync?
To be honest I am currently not sure what the best approach to handle is. I was hoping to spend some time working on this soon so I can prepare a PR, but we had higher priority issues preventing that.
An OOM error could occur in many places, not just one; and it also could occur in different threads. Handling it in one central place is not as easy as it sounds.
Suggestions are always welcome.
I definitely cannot give any solid advice but just to share what we have implemented as the CommandListener hook
public void commandFailed(CommandFailedEvent event) {
Throwable cause = event.getCause();
notifyCustomSystem(cause);
}
private void notifyCustomSystem(Throwable cause) {
boolean realError = false;
Throwable e = cause;
while( e != null ) {
if (e instanceof Error || e.getMessage() != null && e.getMessage().contains("does not support set")) {
realError = true;
break;
}
e = cause.getCause();
}
if(realError) {
log.warn("Performing full reset of RedisClients due to", e);
fullReset(e);
}
}
where fullReset closes all existing redis clients (they are re-created on as needed basis). So it's not about catching the OOM and other errors but rather to restore the system faster. There's still damage possible but not larger than if the system is restarted globally. We also log and monitor the cases.
In our case the OOMs are mostly DEV and QA environment issues so this "fix" is just a time saver mostly for the diverse teams working on the same environment.
Interestingly the pooled connections may also suffer from this if the supplier in the ConnectionPoolSupport.createGenericObjectPool(Supplier<T> connectionSupplier, GenericObjectPoolConfig<T> config)
reuses the same RedisClient, so anyone might want to take this into account, the supplier may need to have to reconnect.
On a side note we had to code around RedisLoadingException and some network exceptions to incorporate "retries with timeouts" and make a bit higher level APIs on top of Lettuce API. Just bringing it in here because it may be rather a common code and related to handling the OOMs and other errors, too, internally
Seems logical for your case.
However there is one caveat - the thread in the event loop that handles the replies will continue working after the OOM and it would parse (when memory is available again) the next replies, which could put the driver in an invalid state again. So there is a timeframe - between the OOM happening and your code calling a fullReset(e) - in which invalid data could be processed.
IMHO the solution needs to be tied with the Netty flow. I will try to dedicate some time for this soon.
@tishun any update on this?
We are considering this for the 7.0 release, but it is going to be touch and go whether or not it makes it.
@tishun Do we have timeline to fix this issue ?