trino icon indicating copy to clipboard operation
trino copied to clipboard

Reject task updates addressed to previous failed node

Open findepi opened this issue 1 year ago • 17 comments

In case of a worker node failure, a new node can be started with same IP address. Once the new node is done initializing and registers with discovery service, the coordinator will eventually be aware of it and use it for processing queries. Before then, and before coordinator is aware of node failure, it may try to reach the old failed node. Since the old and the new may have same IP address, coordinator may effective talk to uninitialized worker node as if it was ready to service queries. This may lead to failures such as "unknown catalog" or rather "unknown handle id", when decoding task update requests.

Prevent such hard to understand failures by passing node ID in task update requests. Provided that the new worker is started with a different node ID, these requests will be cleanly rejected without trying to service them.

Per https://github.com/trinodb/trino/issues/21735#issuecomment-2101821816 fixes https://github.com/trinodb/trino/issues/21735

findepi avatar Apr 29 '24 08:04 findepi

One of my favorite all time bug fix

wendigo avatar Apr 29 '24 08:04 wendigo

What happens when a node fails and restarts with the same IP and node id? This is a legitimate case, as the node id is expected to be stable for a given node across restarts (especially, when the node may have persistent data - raptor, etc)

martint avatar Apr 29 '24 16:04 martint

What happens when a node fails and restarts with the same IP and node id?

That's definitely not handled by this PR.

I was initially thinking about introducing something like "incarnation ID" (e.g. a UUID randomly chosen at a node start). Switched over to node ID on the assumption that (1) very quick node restarts with IP re-use are most likely to happen in kubernetes env and (2) node ID can be generated there (it's generated when not set, right?)

cc @elonazoulay

findepi avatar Apr 29 '24 20:04 findepi

What happens when a node fails and restarts with the same IP and node id?

That's definitely not handled by this PR.

I was initially thinking about introducing something like "incarnation ID" (e.g. a UUID randomly chosen at a node start). Switched over to node ID on the assumption that (1) very quick node restarts with IP re-use are most likely to happen in kubernetes env and (2) node ID can be generated there (it's generated when not set, right?)

cc @elonazoulay

That sounds like a good idea! It looks like 1) and 2) are correct. This is the behavior if node id is unset

elonazoulay avatar Apr 29 '24 20:04 elonazoulay

Related, I noticed that we have X-Trino-Task-Instance-Id which is created on the remote task and returned for task result responses. This was added as part of FTE. With this change, we should be able to remove that as this change seems to serve the same purpose. cc @dain @losipiuk

electrum avatar Apr 29 '24 21:04 electrum

I think the problem of node restarts is more complicated than it appears. We probably should track the instance ID and node IDs for remote tasks in the coordinator and proactively fail them once we detect that the node has restarted (same node ID with different instance ID, or different IDs for the same task host/port).

@losipiuk How does this affect FTE?

electrum avatar Apr 29 '24 21:04 electrum

We probably should track the instance ID and node IDs for remote tasks in the coordinator and proactively fail them once we detect that the node has restarted

This might be needed, but also it may turn out unnecessary. The task ID identifies the task a node was doing. If a node restarts, it won't know that task ID, so it won't update coordinator on its progress. This should result in task failure. What I think we can improve is the speed of failure recovery. This probably requires more deliberate approach, with some well chosen error codes to prevent retries from the coordinator.

we have X-Trino-Task-Instance-Id

Indeed. It seems it was added in https://github.com/trinodb/trino/commit/8d15b144ec8915eb2aadd37ada340adcd82eb261 it looks like aiming to address the "proactively fail fast" goal in some situations.

findepi avatar Apr 30 '24 07:04 findepi

Instead, this should use instance ID via NodeInfo.getInstanceId(), which will be unique on every restart. Unfortunately, we don't provide this in discovery announcements. I think the easiest approach is to modify the code in Server.updateConnectorIds() to add instance ID as a property.

Done

I also like @jklamer's idea of using an HTTP request filter. Instead of modifying the request paths, we can add a header X-Trino-Instance-Id that is validated in a filter. This makes it easy to add validations to more resources by adding the header in the request.

Not done at this point.

It is easy to inject validation when header is present, but it's harder to inject validation that requires the header to be present. For instance, I have problem finding io.trino.server.TaskResource#getAllTaskInfo usage and the usage would need to be updated to pass the header. Also, a really good filter should not be limited to TaskResource only.

findepi avatar Apr 30 '24 07:04 findepi

I think the problem of node restarts is more complicated than it appears. We probably should track the instance ID and node IDs for remote tasks in the coordinator and proactively fail them once we detect that the node has restarted (same node ID with different instance ID, or different IDs for the same task host/port).

@losipiuk How does this affect FTE?

As you already noticed there is a mechanism for handling this problem ~for FTE specifically~ already in the code. Having more general solution works too. We will just need to update isWorkerCrashAssociatedError method with relevant error code.

losipiuk avatar May 06 '24 07:05 losipiuk

( just rebased )

findepi avatar May 06 '24 08:05 findepi

PTAL

findepi avatar May 06 '24 10:05 findepi

Error:    TestDiscoveryNodeManager.testGetAllNodes:102 » NullPointer Cannot invoke "String.trim()" because "instanceId" is null
Error:    TestDiscoveryNodeManager.testGetCoordinators:172 » NullPointer Cannot invoke "String.trim()" because "instanceId" is null
Error:    TestDiscoveryNodeManager.testGetCurrentNode:152 » NullPointer Cannot invoke "String.trim()" because "instanceId" is null
Error:    TestDiscoveryNodeManager.testNodeChangeListener:211 » NullPointer Cannot invoke "String.trim()" because "instanceId" is null

wendigo avatar May 06 '24 11:05 wendigo

CI

findepi avatar May 06 '24 12:05 findepi

@losipiuk @electrum @wendigo ptal

findepi avatar May 07 '24 07:05 findepi

Requires https://github.com/airlift/discovery/pull/53

findepi avatar May 07 '24 08:05 findepi

Added a test.

findepi avatar May 07 '24 20:05 findepi

https://github.com/trinodb/trino/pull/21921 is a simpler alternative that avoids the problem described here: https://github.com/trinodb/trino/pull/21744#discussion_r1595040949

findepi avatar May 10 '24 14:05 findepi