aim icon indicating copy to clipboard operation
aim copied to clipboard

server-client connection error

Open hstojic opened this issue 2 years ago • 2 comments

🐛 Bug

We are running an Aim server on Kubernetes and tracking experiments from multiple VMs, sometimes we see the following error, after which no metric would be submitted to Aim, even though the experiment would continue.

E1109 19:56:44.231976010  299181 ssl_transport_security.cc:552] Corruption detected.
E1109 19:56:44.232044682  299181 ssl_transport_security.cc:528] error:100003fc:SSL routines:OPENSSL_i
nternal:SSLV3_ALERT_BAD_RECORD_MAC
E1109 19:56:44.232062463  299181 secure_endpoint.cc:205]     Decryption error: TSI_DATA_CORRUPTED
Exception in thread Thread-13 (worker):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/rpc
_queue.py", line 55, in worker
    if self._try_exec_task(task_f, *args):
  File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/rpc
_queue.py", line 85, in _try_exec_task
    raise e
  File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/rpc
_queue.py", line 81, in _try_exec_task
    task_f(*args)
  File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/cli
ent.py", line 299, in _run_write_instructions
    response = self.remote.run_write_instructions(message_stream_generator(), metadata=self._request_
metadata)
  File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/grpc/_channel.py", li
ne 1131, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/grpc/_channel.py", li
ne 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Stream removed"
        debug_error_string = "{"created":"@1699559804.232117923","description":"Error received from p
eer ipv4:10.91.128.8:443","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Stre
am removed","grpc_status":2}"

To reproduce

I don't have a MWE, it seems stochastic, happens sometimes.

Expected behavior

Tracking is reliable, no metrics/experiments are lost.

Environment

  • Aim Version: 3.17.5
  • Python version: 3.10
  • pip version:
  • OS: Linux
  • Any other relevant information

hstojic avatar Nov 10 '23 09:11 hstojic

Hey @hstojic! Thanks for the report. From the logs above it seems that something goes wrong when using SSL and grpc (we're using grpc for tracking server). I've done some digging, and found some similar issues reported on grpc's repo.

  1. https://github.com/grpc/grpc/issues/28557

setting GRPC_POLL_STRATEGY environment variable for clients

export GRPC_POLL_STRATEGY=poll

  1. https://github.com/grpc/grpc/issues/23144#issuecomment-645601687

Let me know if these suggestions helps to resolve the issue. If not, I would ask you to provide a little more details about the setup you're running.

mihran113 avatar Nov 10 '23 16:11 mihran113

thank you for a quick response, I'll try with that environment variable

hstojic avatar Nov 10 '23 20:11 hstojic