Timeout on filter requests on large topics
Version: 2.4.x Tested Browser: Google Chrome Version 123.0.6312.105 (Official Build) (64-bit)
We discovered something after updating to 2.4.x (any version labeled with 2.4). When we try to do filter queries on a topic that is large they suddenly stop to work after the update.
After some investigation this seems to be an issue due to the removal of websockets in favor of a standard HTTP request. The http requests timeout after 1 second while the websockets where streaming the data which caused the connection to stay alive. The timeout comes from the backend which closes the connection after one minute.
This can easily be reproduced by creating a large topic with values like {"active": false} in which one value has a true on the active field.
If we now filter over the complete topic (offset earliest) and ask for active true, there will be a timeout.
I think the related commit / series of commits is: dc8d1d13
Hey @flamestro , we use HTTP Streaming instead of Websockets for better compatibility in most environments. I believe the timeout may be caused by something higher up in your stack (a cloud LoadBalancer or NGINX). You should receive regular heartbeats via the http stream, and thus not run into idle timeouts. We definitely don't have timeouts for 1s in place in Console.
I am sorry 1 second was wrong. I Meant 1 minute, 60 seconds.
I actually doubt that this is due to the stack around it as it only happens on the switch from 2.3.x to 2.4.x. I will still try to reproduce this again in a docker compose setup locally with nothing around the console and redpanda backend.
@flamestro We switched from WEbsockets to HTTP streaming with v2.3.x to v2.4.x - so different timeouts may apply. We are using this build in Redpanda Cloud and are able to stream messages with search requests running >1min. Please make sure to use one of the latest v2.4 releases. I think with v2.4.6 we added a heartbeat for specific requests which may help in your situation already - as depending on the search request you would otherwise run into an idle timeout from a TCP loadbalancer.
For other search requests (start offset oldest) we were already sending heartbeats in the form of a progress message. So in case your 60s interrupt is an idle timeout the most recent v2.4.10 release should definitely already help. I'm not entirely sure what v2.4 release you tried.
Hey @weeco, I tried every version available with 2.4.x, including 2.4.6. Sadly none of those versions work out for us.
We have quite some entries which might not help with this situation:
This is the loading indicator which takes for ever and then crashes. The same does not happen with 2.3.10
The search filter is only key === "someId".
@flamestro Does this issue still exist? We implemented heartbeating to avoid idle timeouts in loadbalancers along the chain. I'll close for now because I believe this is no longer an issue and we haven't got any other reports regarding this anymore.
Hello @weeco, we are facing a similar issue with large topics, it is working fine through port forwarding but not with ingress. Tested version: v2.6.0 Issue started since v2.4.X (Switch to HTTP Streaming instead of Websockets)
The progress got stuck in 110008 for several seconds and then the network error appeared
@rabii17 Could you check your NGINX logs if the reverse proxy is cancelling the connection? Potentially it's also the TCP loadbalancer if you use one. We use Console in an environment where requests can run longer than 60s, thus I believe the problem is outside Console.
@weeco The only error reported by the ingress was upstream prematurely closed connection while reading upstream
But as I said it was working in the previous versions. It should be easily reproduced with a redpanda-console instance published through an nginx ingress.
@rabii17 Depending on what previous versions you referred to there was a difference: We replaced websockets for messages streaming with HTTP 1.1 streaming. This may be the difference. We have Console deployments behind NGINX and are not experiencing such timeouts. I believe our deployment is fairly standard.
If you can provide a docker compose where this is reproducible I'm happy to take a look
I was finally able to sit down and continue my investigations here.
I was able to solve the problem by adding proxy_buffering off; in our nginx config. Another solution that works (without disabling buffering) is to increase the default timeouts of nginx.
Turns out that if your buffer size is big enough for the complete response (which it most likely is), then nginx will not communicate with the client until the response is fully collected. This takes longer than 60 seconds (default timeout on nginx side).