teraslice icon indicating copy to clipboard operation
teraslice copied to clipboard

Test k8s ex job failure scenario

Open godber opened this issue 6 years ago • 5 comments

There is probably an issue with handling the case where the k8s job for the teraslice execution has 0 running pods due to external failure. I think this results in workers being abandoned. Maybe Mike can say more.

godber avatar Nov 14 '18 18:11 godber

Correct, once the execution controller job is marked as terminated, completed, etc and the pod goes away the deployment / workers remain but end up in a crash loop. In the case I experienced the state store went offline causing the orphan situation.

Possible ways to replicate:

  • Stop the stateful storage or drop egress via network policy and wait for the jobs to fail
  • Scale the worker deployments to 0 and I think the execution controller job may complete

When trying to clean up from my situation where elasticsearch became unavailable:

  1. First thing I did was kubectl -n <teraslice namespace> get deployments --show-labels=true and recorded all the jobids still in k8s.
  2. Tried to call /jobs/<jobid>/_stop to clean up the exec job plus worker deployment under k8s:
    • If it responded with stopped then I did nothing else since everything cleaned up normally
    • If it responded with anything else (like terminated) I manually deleted the execution controller (job) and workers (deployment)
  3. I then called /jobs/<jobid>/_start to redeploy the job after the original state issue was resolved

I think ideally once a job is completed, terminated, etc the now orphaned deployment either gets cleaned up or scaled down to 0 to avoid unnecessary crash loops or k8s cruft. I suspect deciding which way to go depends on how jobs are expected to be recovered in the event of execution controller completion / failure.

mkhpalm avatar Nov 14 '18 18:11 mkhpalm

The one time I managed to get the execution controller to shut down (see #1018) by scaling to zero it cleaned up all the other bits with it. I know there's something here though so I am not taking this as really compelling evidence.

godber avatar Feb 12 '19 23:02 godber

I just tried to generate a job failure scenario by doing the following

Scale the worker deployments to 0 and I think the execution controller job may complete

The job ends up in the failed state with the following _failureReason:

curl -Ss $(minikube ip):30678/ex/2a934ff3-9b23-445f-ae35-9a880158c983 | jq -r ._failureReason
TSError: slicer for ex 2a934ff3-9b23-445f-ae35-9a880158c983 had an error, shutting down execution, caused by Error: All workers from workers from 2a934ff3-9b23-445f-ae35-9a880158c983 have disconnected
    at ExecutionController._terminalError (/app/source/packages/teraslice/lib/workers/execution-controller/index.js:322:23)
    at Timeout.<anonymous> (/app/source/packages/teraslice/lib/workers/execution-controller/index.js:983:18)
    at listOnTimeout (internal/timers.js:549:17)
    at processTimers (internal/timers.js:492:7)
Caused by: Error: All workers from workers from 2a934ff3-9b23-445f-ae35-9a880158c983 have disconnected
    at ExecutionController._startWorkerDisconnectWatchDog (/app/source/packages/teraslice/lib/workers/execution-controller/index.js:975:21)
    at /app/source/packages/teraslice/lib/workers/execution-controller/index.js:174:18
    at Server.<anonymous> (/app/source/packages/teraslice-messaging/dist/src/messenger/server.js:179:13)
    at Server.emit (events.js:327:22)
    at Server.emit (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:108:15)
    at Server.updateClientState (/app/source/packages/teraslice-messaging/dist/src/messenger/server.js:279:18)
    at Socket.<anonymous> (/app/source/packages/teraslice-messaging/dist/src/messenger/server.js:317:22)
    at Socket.emit (events.js:315:20)
    at Socket.emit (/app/source/node_modules/socket.io/lib/socket.js:141:10)
    at Socket.onclose (/app/source/node_modules/socket.io/lib/socket.js:441:8)
    at Client.onclose (/app/source/node_modules/socket.io/lib/client.js:235:24)
    at Socket.emit (events.js:327:22)
    at Socket.onClose (/app/source/node_modules/engine.io/lib/socket.js:311:10)
    at Object.onceWrapper (events.js:421:28)
    at WebSocket.emit (events.js:315:20)
    at WebSocket.Transport.onClose (/app/source/node_modules/engine.io/lib/transport.js:127:8)

The logs in the Execution Controller pod after the scale down are:

[2020-08-19T20:36:46.473Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: client 172.17.0.6__6JW2Dpcy disconnected { reason: 'transport close' } (assignment=execution_controller, module=messaging:server, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:36:46.486Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: client 172.17.0.7__yyJqPfWe disconnected { reason: 'transport close' } (assignment=execution_controller, module=messaging:server, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.489Z] ERROR: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: slicer for ex 2a934ff3-9b23-445f-ae35-9a880158c983 had an error, shutting down execution, caused by Error: All workers from workers from 2a934ff3-9b23-445f-ae35-9a880158c983 have disconnected (assignment=execution_controller, module=execution_controller, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae, err.code=INTERNAL_SERVER_ERROR)
    TSError: slicer for ex 2a934ff3-9b23-445f-ae35-9a880158c983 had an error, shutting down execution, caused by Error: All workers from workers from 2a934ff3-9b23-445f-ae35-9a880158c983 have disconnected
        at ExecutionController._terminalError (/app/source/packages/teraslice/lib/workers/execution-controller/index.js:322:23)
        at Timeout.<anonymous> (/app/source/packages/teraslice/lib/workers/execution-controller/index.js:983:18)
        at listOnTimeout (internal/timers.js:549:17)
        at processTimers (internal/timers.js:492:7)
    Caused by: Error: All workers from workers from 2a934ff3-9b23-445f-ae35-9a880158c983 have disconnected
        at ExecutionController._startWorkerDisconnectWatchDog (/app/source/packages/teraslice/lib/workers/execution-controller/index.js:975:21)
        at /app/source/packages/teraslice/lib/workers/execution-controller/index.js:174:18
        at Server.<anonymous> (/app/source/packages/teraslice-messaging/dist/src/messenger/server.js:179:13)
        at Server.emit (events.js:327:22)
        at Server.emit (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:108:15)
        at Server.updateClientState (/app/source/packages/teraslice-messaging/dist/src/messenger/server.js:279:18)
        at Socket.<anonymous> (/app/source/packages/teraslice-messaging/dist/src/messenger/server.js:317:22)
        at Socket.emit (events.js:315:20)
        at Socket.emit (/app/source/node_modules/socket.io/lib/socket.js:141:10)
        at Socket.onclose (/app/source/node_modules/socket.io/lib/socket.js:441:8)
        at Client.onclose (/app/source/node_modules/socket.io/lib/client.js:235:24)
        at Socket.emit (events.js:327:22)
        at Socket.onClose (/app/source/node_modules/engine.io/lib/socket.js:311:10)
        at Object.onceWrapper (events.js:421:28)
        at WebSocket.emit (events.js:315:20)
        at WebSocket.Transport.onClose (/app/source/node_modules/engine.io/lib/transport.js:127:8)
[2020-08-19T20:37:46.509Z] FATAL: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: execution 2a934ff3-9b23-445f-ae35-9a880158c983 is ended because of slice failure (assignment=execution_controller, module=execution_controller, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.509Z] DEBUG: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: stopping scheduler... (assignment=execution_controller, module=execution_scheduler, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.509Z] DEBUG: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: execution 2a934ff3-9b23-445f-ae35-9a880158c983 is finished scheduling, 7 remaining slices in the queue (assignment=execution_controller, module=execution_scheduler, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.511Z]  WARN: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: clients are all offline, but there are still 1 pending slices (assignment=execution_controller, module=execution_controller, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.512Z] DEBUG: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: execution 2a934ff3-9b23-445f-ae35-9a880158c983 did not finish (assignment=execution_controller, module=execution_controller, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.517Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: [START] "elasticsearch_data_generator" operation shutdown (assignment=execution_controller, module=slicer_context, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.517Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: [FINISH] "elasticsearch_data_generator" operation shutdown, took 1ms (assignment=execution_controller, module=slicer_context, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.520Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: calculating statistics (assignment=execution_controller, module=slice_analytics, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.520Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: (assignment=execution_controller, module=slice_analytics, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)

    operation elasticsearch_data_generator
    average completion time of: 673.63 ms, min: 473 ms, and max: 886 ms
    average size: 5000, min: 5000, and max: 5000
    average memory: 5183018, min: -7736312, and max: 8422968

[2020-08-19T20:37:46.520Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: (assignment=execution_controller, module=slice_analytics, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)

    operation example-op
    average completion time of: 0.13 ms, min: 0 ms, and max: 1 ms
    average size: 5000, min: 5000, and max: 5000
    average memory: 2640, min: 1544, and max: 4848

[2020-08-19T20:37:46.520Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: (assignment=execution_controller, module=slice_analytics, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)

    operation delay
    average completion time of: 30000.75 ms, min: 30000 ms, and max: 30001 ms
    average size: 5000, min: 5000, and max: 5000
    average memory: -3723653, min: -12046792, and max: 208104

[2020-08-19T20:37:46.520Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: (assignment=execution_controller, module=slice_analytics, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)

    operation elasticsearch_index_selector
    average completion time of: 24.5 ms, min: 18 ms, and max: 28 ms
    average size: 5000, min: 5000, and max: 5000
    average memory: 1633130, min: 1328184, and max: 1743240

[2020-08-19T20:37:46.520Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: (assignment=execution_controller, module=slice_analytics, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)

    operation elasticsearch_bulk
    average completion time of: 312.75 ms, min: 293 ms, and max: 348 ms
    average size: 5000, min: 5000, and max: 5000
    average memory: 6128912, min: -4279272, and max: 9625368

[2020-08-19T20:37:46.520Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: execution 2a934ff3-9b23-445f-ae35-9a880158c983 has finished in 216 seconds (assignment=execution_controller, module=execution_controller, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.528Z] DEBUG: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: execution 2a934ff3-9b23-445f-ae35-9a880158c983 is done (assignment=execution_controller, module=execution_controller, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.629Z] DEBUG: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: execution shutdown was called for ex 2a934ff3-9b23-445f-ae35-9a880158c983 (assignment=execution_controller, module=execution_controller, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.631Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: shutting down. (assignment=execution_controller, module=ex_storage, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.632Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: shutting down (assignment=execution_controller, module=state_storage, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.636Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: client 2a934ff3-9b23-445f-ae35-9a880158c983 disconnected { reason: 'io client disconnect' } (assignment=execution_controller, module=messaging:client, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:46.835Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: execution_controller received process:SIGTERM, already shutting down, remaining 30s (assignment=execution_controller, module=execution_controller:shutdown_handler, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:51.638Z]  WARN: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: execution controller 2a934ff3-9b23-445f-ae35-9a880158c983 is shutdown (assignment=execution_controller, module=execution_controller, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:51.639Z]  INFO: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: execution_controller shutdown took 5s (assignment=execution_controller, module=execution_controller:shutdown_handler, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)
[2020-08-19T20:37:52.640Z] DEBUG: teraslice/6 on ts-exc-example-data-generator-job-0c8eaaee-146e-spv6h: flushed logs successfully, will exit with code 0 (assignment=execution_controller, module=execution_controller:shutdown_handler, worker_id=UevO1XrS, ex_id=2a934ff3-9b23-445f-ae35-9a880158c983, job_id=0c8eaaee-146e-4136-a1d7-acbe88951eae)

The logs in the master are:

[2020-08-19T20:37:46.530Z] DEBUG: teraslice/14 on teraslice-master-57b6b9b44d-4wzkp: execution 2a934ff3-9b23-445f-ae35-9a880158c983 finished, shutting down execution (assignment=cluster_master, module=execution_service, worker_id=XIZ1YZ4i)
[2020-08-19T20:37:46.538Z]  INFO: teraslice/14 on teraslice-master-57b6b9b44d-4wzkp: k8s._deleteObjByExId: 2a934ff3-9b23-445f-ae35-9a880158c983 execution_controller jobs deleting: ts-exc-example-data-generator-job-0c8eaaee-146e (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=XIZ1YZ4i)
[2020-08-19T20:37:46.571Z]  INFO: teraslice/14 on teraslice-master-57b6b9b44d-4wzkp: k8s._deleteObjByExId: 2a934ff3-9b23-445f-ae35-9a880158c983 worker deployments deleting: ts-wkr-example-data-generator-job-0c8eaaee-146e (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=XIZ1YZ4i)
[2020-08-19T20:37:46.638Z]  INFO: teraslice/14 on teraslice-master-57b6b9b44d-4wzkp: client 2a934ff3-9b23-445f-ae35-9a880158c983 disconnected { reason: 'client namespace disconnect' } (assignment=cluster_master, module=messaging:server, worker_id=XIZ1YZ4i)

I think this is the desired behavior. Perhaps the .failureReason could be improved, but really, the workers went away ... and the execution controller times out and exits and fails the execution. This seems correct. It's possible we'd want more information about the k8s resource to understand an unexpected failure like this. But I think this is OK.

godber avatar Aug 19 '20 21:08 godber

Here are the logs in the execution pod when the master pod is shutdown and restarted.

[2023-11-16T18:04:56.067Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-d9054c49-8c6e-k2v8q: worker 10.244.0.12__Ip8ilsTN has completed its slice 97a0a540-e56a-4e20-b9d9-392b515fa240 (assignment=execution_controller, module=execution_controller, worker_id=dPlbIJ6p, ex_id=d06dfac0-d859-413c-b566-7de9280b91eb, job_id=d9054c49-8c6e-4729-bdc8-5cb4d0f90377)
[2023-11-16T18:04:56.075Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-d9054c49-8c6e-k2v8q: dispatched slice d93c771c-5681-4e14-a486-be9a8a262b69 to worker 10.244.0.12__Ip8ilsTN (assignment=execution_controller, module=execution_controller, worker_id=dPlbIJ6p, ex_id=d06dfac0-d859-413c-b566-7de9280b91eb, job_id=d9054c49-8c6e-4729-bdc8-5cb4d0f90377)
[2023-11-16T18:05:23.866Z]  WARN: teraslice/10 on ts-exc-kafka-to-es-d9054c49-8c6e-k2v8q: cluster master did not record the cluster analytics (assignment=execution_controller, module=execution_analytics, worker_id=dPlbIJ6p, ex_id=d06dfac0-d859-413c-b566-7de9280b91eb, job_id=d9054c49-8c6e-4729-bdc8-5cb4d0f90377)
[2023-11-16T18:05:26.145Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-d9054c49-8c6e-k2v8q: worker 10.244.0.12__Ip8ilsTN has completed its slice d93c771c-5681-4e14-a486-be9a8a262b69 (assignment=execution_controller, module=execution_controller, worker_id=dPlbIJ6p, ex_id=d06dfac0-d859-413c-b566-7de9280b91eb, job_id=d9054c49-8c6e-4729-bdc8-5cb4d0f90377)
[2023-11-16T18:05:26.157Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-d9054c49-8c6e-k2v8q: dispatched slice 62aea529-f6f4-420f-89c9-4435444485de to worker 10.244.0.12__Ip8ilsTN (assignment=execution_controller, module=execution_controller, worker_id=dPlbIJ6p, ex_id=d06dfac0-d859-413c-b566-7de9280b91eb, job_id=d9054c49-8c6e-4729-bdc8-5cb4d0f90377)
[2023-11-16T18:05:56.215Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-d9054c49-8c6e-k2v8q: worker 10.244.0.12__Ip8ilsTN has completed its slice 62aea529-f6f4-420f-89c9-4435444485de (assignment=execution_controller, module=execution_controller, worker_id=dPlbIJ6p, ex_id=d06dfac0-d859-413c-b566-7de9280b91eb, job_id=d9054c49-8c6e-4729-bdc8-5cb4d0f90377)
[2023-11-16T18:05:56.225Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-d9054c49-8c6e-k2v8q: dispatched slice a67e02dc-b8cc-4e62-a5e0-642d9475fda7 to worker 10.244.0.12__Ip8ilsTN (assignment=execution_controller, module=execution_controller, worker_id=dPlbIJ6p, ex_id=d06dfac0-d859-413c-b566-7de9280b91eb, job_id=d9054c49-8c6e-4729-bdc8-5cb4d0f90377)
[2023-11-16T18:06:14.351Z] ERROR: teraslice/10 on ts-exc-kafka-to-es-d9054c49-8c6e-k2v8q: Client ClusterMaster is not ready (assignment=execution_controller, module=messaging:client, worker_id=dPlbIJ6p, ex_id=d06dfac0-d859-413c-b566-7de9280b91eb, job_id=d9054c49-8c6e-4729-bdc8-5cb4d0f90377)
    Error: Client ClusterMaster is not ready
        at Client.waitForClientReady (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:95:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Socket.<anonymous> (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:73:21)
[2023-11-16T18:06:16.364Z] ERROR: teraslice/10 on ts-exc-kafka-to-es-d9054c49-8c6e-k2v8q: Client ClusterMaster is not ready (assignment=execution_controller, module=messaging:client, worker_id=dPlbIJ6p, ex_id=d06dfac0-d859-413c-b566-7de9280b91eb, job_id=d9054c49-8c6e-4729-bdc8-5cb4d0f90377)
    Error: Client ClusterMaster is not ready
        at Client.waitForClientReady (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:95:19)
        at async Socket.<anonymous> (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:73:21)
[2023-11-16T18:06:18.362Z] ERROR: teraslice/10 on ts-exc-kafka-to-es-d9054c49-8c6e-k2v8q: Client ClusterMaster is not ready (assignment=execution_controller, module=messaging:client, worker_id=dPlbIJ6p, ex_id=d06dfac0-d859-413c-b566-7de9280b91eb, job_id=d9054c49-8c6e-4729-bdc8-5cb4d0f90377)
    Error: Client ClusterMaster is not ready
        at Client.waitForClientReady (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:95:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Socket.<anonymous> (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:73:21)

sotojn avatar Nov 16 '23 21:11 sotojn

We have made improvements to K8s, please check if this:

https://github.com/terascope/teraslice/issues/893#issuecomment-676725895

still happens, if it does not, please indicate as much and close this issue. If it still happens, then consider a solution.

godber avatar Jul 23 '24 22:07 godber

This no longer a problem, updating node and libraries versions helped fix this

jsnoble avatar Oct 09 '24 22:10 jsnoble