lettuce icon indicating copy to clipboard operation
lettuce copied to clipboard

Responses getting out of sync with requests

Open jkern888 opened this issue 2 years ago • 17 comments

Current Behavior

We've recently updated lettuce to 6.1.1.RELEASE and after running it in our production environment for a week we ran into a troubling issue. At some point it looks like lettuce worked itself into a state where commands were receiving responses to different requests. This manifested as lots of deserialization errors (e.g. java.lang.UnsupportedOperationException: io.lettuce.core.output.ValueOutput does not support set(long)), but also more seriously as incorrect data being returned to the caller.

Restarting the service brought things back to a working state. Prior to this happening the only other thing we observed were some RedisCommandTimeoutException being thrown, but not at an anomalous level.

For the meantime we've added some more aggressive checking for this happening, so if it does reoccur I'd be happy to try and gather more information, but not sure what would be helpful?

Environment

  • Lettuce version(s): 6.1.1.RELEASE
  • Redis version: 3.2.10

jkern888 avatar Feb 15 '22 15:02 jkern888

Depending on how you use the library this might happen when using e.g. a single connection with transactions from multiple threads without proper synchronization.

mp911de avatar Feb 18 '22 12:02 mp911de

@mp911de Thanks for the comment. I don't think that's the case here as our usage is pretty uncomplicated, just using the synchronous api to do get, set, expire, del from a single connection. We do also use the compression codec to zip the data, along with jackson for de/serializtion. But all of those claim to be threadsafe.

The fact this could happen from incorrect usage like you said is interesting though. Is there a particular part of the library you could point me to? May help in debugging since possibly there's some other path into the same state?

jkern888 avatar Feb 18 '22 15:02 jkern888

Hum, when using Jackson as Codec, then encoding errors may bring the protocol out of sync. Please check whether you can see some exceptions prior to getting out of sync.

mp911de avatar Feb 21 '22 08:02 mp911de

@mp911de Interesting. Here's the code for the codec we're using. Should we be swallowing the exception and returning null?

public class JacksonRedisCodec<T> implements RedisCodec<String, T> {

    private final ObjectWriter writer;
    private final ObjectReader reader;

    // Encoding used for the key string.
    private static final Charset UTF_8 = Charsets.UTF_8;

    public JacksonRedisCodec(ObjectWriter writer, ObjectReader reader) {
        this.writer = writer;
        this.reader = reader;
    }

    @Override
    public String decodeKey(ByteBuffer byteBuffer) {
        ByteBufferBackedInputStream inputStream = new ByteBufferBackedInputStream(byteBuffer);
        StringWriter writer = new StringWriter();
        try {
            IOUtils.copy(inputStream, writer, UTF_8.name());
        }
        catch (IOException e) {
            throw new RuntimeException(e);
        }
        return writer.toString();
    }

    @Override
    public T decodeValue(ByteBuffer byteBuffer) {
        ByteBufferBackedInputStream inputStream = new ByteBufferBackedInputStream(byteBuffer);
        try {
            return reader.readValue(inputStream);
        }
        catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    @Override
    public ByteBuffer encodeKey(String string) {
        return ByteBuffer.wrap(string.getBytes(UTF_8));
    }

    @Override
    public ByteBuffer encodeValue(T t) {
        try {
            return ByteBuffer.wrap(writer.writeValueAsBytes(t));
        }
        catch (JsonProcessingException e) {
            throw new RuntimeException(e);
        }
    }
} 

jkern888 avatar Feb 23 '22 15:02 jkern888

Any exceptions that happen on the codec level affect directly what is being sent to the Redis server. If you have say a command sequence of SET -> SET -> GET and the second SET throws an exception (SET -> 💥 -> GET), then the client is no longer synchronized to the protocol as the second command was never sent to the Redis server.

We try to recover as much as possible upon encoding/decoding but there's always a fair chance that exceptions during encoding/decoding may impact the protocol synchronization.

I strongly suggest moving the encoding/decoding part into a higher level component so that you can handle exceptions properly.

mp911de avatar Feb 24 '22 14:02 mp911de

The same issue has just re-occurred. Digging into our logging I'm seeing no exceptions around the codecs. The only thing I do see is a command timeout:

io.lettuce.core.RedisCommandTimeoutException: Command timed out after 1 second(s)
	at io.lettuce.core.internal.ExceptionFactory.createTimeoutException(ExceptionFactory.java:53)
	at io.lettuce.core.protocol.CommandExpiryWriter.lambda$potentiallyExpire$0(CommandExpiryWriter.java:172)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
	at io.netty.util.concurrent.DefaultEventExecutor.run(DefaultEventExecutor.java:66)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:832)

Immediately after that we start getting many errors due to responses being de-synced. We see that same timeout exception at other times with no ill effect, but maybe it's occasionally triggering this bad state?

jkern888 avatar Feb 25 '22 16:02 jkern888

Possible result of being out of sync https://github.com/lettuce-io/lettuce-core/issues/2175

bytestreme avatar Aug 10 '22 09:08 bytestreme

@jkern888 I have a question. I meet same problem but not related with codec. you said problem was occured when upgrade to 6.1.1. So, I Wondering which version is your last succeed version?

previously, I use lettuce v5.x and this version does not have this problem.

  • My environment
  • Spring Boot Webflux 2.7.2
  • ReactiveRedisTemplate
  • lettuce 6.1.9.RELEASE

rndmania avatar Aug 22 '22 08:08 rndmania

We have got the same problem. And I reproduced the problem by using below codes:

@SpringBootTest
@TestInstance(TestInstance.Lifecycle.PER_CLASS)
public class RedisTest {

    @Autowired
    private ReactiveRedisTemplate redisTemplate;

    private ExecutorService executorService = Executors.newFixedThreadPool(20);

    @Test
    public void testRedis() throws Exception{
        Random random = new Random();
        AtomicInteger counter = new AtomicInteger(0);
        for (int i = 0; i < 100000; i++) {
            int finalI = i;
            executorService.submit(() -> {
                redisTemplate.opsForValue().set("testkey", "testvalue" + finalI)
                        //.publishOn(Schedulers.boundedElastic())
                        .flatMap(r1 -> {
                            System.out.println(Thread.currentThread().getName() + " : result1 = " + r1);
                            return redisTemplate.opsForHash().get("test_hash", "hash_field") //, "test_value" + finalI)
                                   // .publishOn(Schedulers.boundedElastic())
                                    .defaultIfEmpty("")
                                    .flatMap(r2 -> {
                                        if(random.nextInt() % 2 == 0 && counter.get() < 100){
                                            counter.incrementAndGet();
                                            throw new OutOfMemoryError("test out of memory error");
                                        }
                                        System.out.println(Thread.currentThread().getName() + " : result2 = " + r2);
                                        return redisTemplate.opsForHash().put("test_hash", "hash_field", "hash_value" + finalI)
                                                //.publishOn(Schedulers.boundedElastic())
                                                .flatMap(r3 -> {
                                                            System.out.println(Thread.currentThread().getName() + " : result3 = " + r3);
                                                            return redisTemplate.opsForValue()
                                                                    .get("testkey")
                                                                    //.publishOn(Schedulers.boundedElastic())
                                                                    .map(r4 -> {
                                                                        System.out.println(Thread.currentThread().getName() + " : result4 = " + r4);
                                                                        return r4;
                                                                    });
                                                        });

                                    });
                        })
//                        .subscribeOn(Schedulers.boundedElastic())
                        .subscribe();
            });
            TimeUnit.MILLISECONDS.sleep(10);

        }
        System.in.read();
    }

}

My environment is:

  • Spring Boot 2.6.8
  • ReactiveRedisTemplate
  • lettuce 6.1.8.RELEASE

I found if my business codes throw errors, it will be catched by the lettuce pool thread.While this will finally lead to java.lang.UnsupportedOperationException: io.lettuce.core.output.ValueOutput does not support set(long) exception. And If I indicate a thread pool to my business codes by publishOn, the error won't be caught by lettuce thread.This can void the exception happening.

qinchunabng avatar Nov 13 '22 04:11 qinchunabng

@qinchunabng Thanks for sharing. I have one question. You example code has two comments. one is 'publishOn' and other is 'subscribeOn' To avoid exception happening, needs uncomment both?

rndmania avatar Jan 02 '23 14:01 rndmania

@qinchunabng Thanks for sharing. I have one question. You example code has two comments. one is 'publishOn' and other is 'subscribeOn' To avoid exception happening, needs uncomment both?

Just uncomment 'publishOn'. 'subscribeOn' is testing code.

qinchunabng avatar Jan 03 '23 03:01 qinchunabng

@qinchunabng Thanks for sharing. I have one question. You example code has two comments. one is 'publishOn' and other is 'subscribeOn' To avoid exception happening, needs uncomment both?

Just uncomment 'publishOn'. 'subscribeOn' is testing code.

hi bro, how to resolve

KamToHung avatar Feb 15 '23 16:02 KamToHung

@qinchunabng Thanks for sharing. I have one question. You example code has two comments. one is 'publishOn' and other is 'subscribeOn' To avoid exception happening, needs uncomment both?

Just uncomment 'publishOn'. 'subscribeOn' is testing code.

hi bro, how to resolve Using publishOn indicates the business code in flatMap executing in other thread pool not in lettuce's thread pool. Through my test if your code run in lettuce's thread pool and it has some exceptions of Error occuring,there will probably result out of sync.

qinchunabng avatar Feb 16 '23 01:02 qinchunabng

Our team also encountered this sporadic issue, which occurs approximately once every 1-2 weeks. Through our testing, we found that this problem only occurs on aarch64 servers and is not observed on X86-64 servers. We were able to resolve this issue by adjusting the version of the JDK.

Here is the environmental information: Redis Server: 6.2.5 spring-data-redis: 2.1.20.RELEASE springboot: 2.1.17.RELEASE Lettuce: 6.0.7 Release

Test Result: Oracle JDK 8u131 ---> Exception Oracle JDK 8u201 ---> Exception Oracle JDK 8u251 ---> OK

Open JDK 8u352 ---> OK Open JDK 8u292 ---> OK

wqq0712 avatar Jun 27 '23 09:06 wqq0712

@wqq0712 Thanks for good information.

I have meet this problem with x86-64 server and Open JDK 8u222

So, I think aarch64 and x86-64 is not important condigion. I will try test with various Open JDK version.

Thank you.

rndmania avatar Jun 28 '23 13:06 rndmania

I do not think that the JDK version is involved. Rather, out of sync happens either if a codec fails encoding/decoding or an output fails (or any other exotic condition happens).

Happy to provide support if you can provide a reliable reproducer.

mp911de avatar Jun 29 '23 08:06 mp911de

@wqq0712 Thanks for good information.

I have meet this problem with x86-64 server and Open JDK 8u222

So, I think aarch64 and x86-64 is not important condigion. I will try test with various Open JDK version.

Thank you.

@wqq0712 Thanks for good information.

I have meet this problem with x86-64 server and Open JDK 8u222

So, I think aarch64 and x86-64 is not important condigion. I will try test with various Open JDK version.

Thank you.

@rndmania

I think maybe there is a bug in the 8u222 version, which I also use on the aarch64 platform. The problem stopped when I upgraded to 8u341.

FULaBUla avatar Feb 28 '24 14:02 FULaBUla