ksql icon indicating copy to clipboard operation
ksql copied to clipboard

Inactive cluster members and failed select queries

Open dberardo-com opened this issue 2 years ago • 4 comments

I have been experimenting with KSQLDB clusters and i have pull up and down the same docker service a couple of times. Now i am left with a situation where only one ksqldb container is up, but i get this error on some pul queries:

Error starting pull query: Unable to execute pull query. Partition 0 failed to find valid host. Hosts scanned: 8f0737fa57cb:8088 was not selected because Lag information is not present for host.

If i run a GET query to the /clusterStatus API endpoint i get:

{
    "clusterStatus": {
        "8f0737fa57cb:8088": {
            "hostAlive": true,
            "lastStatusUpdateMs": 1653633878034,
            "activeStandbyPerQuery": {
                ...long list
            },
            "hostStoreLags": { ... lnglist}
        },
        "55a2557e4c22:8088": {
            "hostAlive": false,
            "lastStatusUpdateMs": 1653634468072,
            "activeStandbyPerQuery": {},
            "hostStoreLags": {
                "stateStoreLags": {},
                "updateTimeMs": 0
            }
        },
        "8745fb3168e7:8088": {
            "hostAlive": false,
            "lastStatusUpdateMs": 1653633880122,
            "activeStandbyPerQuery": {},
            "hostStoreLags": {
                "stateStoreLags": {},
                "updateTimeMs": 0
            }
        },
        "80c3d9954cc4:8088": {
            "hostAlive": false,
            "lastStatusUpdateMs": 1653634788672,
            "activeStandbyPerQuery": {},
            "hostStoreLags": {
                "stateStoreLags": {},
                "updateTimeMs": 0
            }
        }
    }
}

my guess is that the inactive hosts should be dropped from the cluster, but i cannot find a way to achieve that.

what could the error be?

dberardo-com avatar May 27 '22 07:05 dberardo-com

also, sometimes i get this JAVA exception in a WARNING:

WARNING: Thread Thread[vert.x-eventloop-thread-9,5,main]=Thread[vert.x-eventloop-thread-9,5,main] has been blocked for 2170394 ms, time limit is 2000 ms io.vertx.core.VertxException: Thread blocked at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)

dberardo-com avatar May 27 '22 07:05 dberardo-com

Hi @dberardo-com !

Are you including the max allowed lag parameter in the request? How have you configured ksql? Could you share your config file? Also, can you share the entire clusterStatus response, specifically the part of the "hostStoreLags": { ... lnglist}?

From the clusterStatus response you shared, I am assuming you have configured a cluster of 4 nodes which means ksql has assigned partitions to each of these nodes. When a node goes down, the partitions of the failed node are moved to the other nodes. In this case, the partitions of all other nodes must move to the only node that is still alive. My guess is that you tried to issue the query while the rebalance was still in progress?

The fact that the other nodes still appear in the clusterStatus response is not an issue as they don't participate in pull query logic.

vpapavas avatar Jun 14 '22 16:06 vpapavas

Hi @vpapavas,

I have a similar issue with @dberardo-com. Below is error message during executing query:

Error starting pull query: Unable to execute pull query. [Partition 0 failed to find valid host. Hosts scanned: [7f039edca983:8088 was not selected because Host is not alive as of time 1656708150041,Lag information is not present for host.]]

We set up precisely the high availability configuration indicated in your article:

ksql.streams.num.standby.replicas=1
ksql.query.pull.enable.standby.reads=true
ksql.heartbeat.enable=true
ksql.lag.reporting.enable=true
ksql.query.pull.max.allowed.offset.lag=100

The service is in an EC2 AWS container. We have the following machine configuration:

listeners=http://0.0.0.0:8088
ksql.advertised.listener=http://OWNER_INTERNAL1_URL:8088,http://OWNER_INTERNAL2_URL:8088,http://OWNER_INTERNAL3_URL:8088,http://OWNER_INTERNAL4_URL:8088,http://OWNER_INTERNAL5_URL:8088

Before the above configuration, we also tried to use ALB AWS service for routing the internal addresses, giving the url of the ALB in ksql.advertised.listener but it didn't work either. We also increased the ksql.query.pull.max.allowed.offset.lag and it didn't work.

This is the output of the ClusterStatus endpoint, which always shows only 1 active node.

Any hints on what could be wrong? Thanks a lot for your time.

borbavanessa avatar Jul 01 '22 21:07 borbavanessa

@borbavanessa were you ever able to resolve this issue?

ntalcus avatar Dec 05 '23 18:12 ntalcus