server-client connection error
🐛 Bug
We are running an Aim server on Kubernetes and tracking experiments from multiple VMs, sometimes we see the following error, after which no metric would be submitted to Aim, even though the experiment would continue.
E1109 19:56:44.231976010 299181 ssl_transport_security.cc:552] Corruption detected.
E1109 19:56:44.232044682 299181 ssl_transport_security.cc:528] error:100003fc:SSL routines:OPENSSL_i
nternal:SSLV3_ALERT_BAD_RECORD_MAC
E1109 19:56:44.232062463 299181 secure_endpoint.cc:205] Decryption error: TSI_DATA_CORRUPTED
Exception in thread Thread-13 (worker):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/rpc
_queue.py", line 55, in worker
if self._try_exec_task(task_f, *args):
File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/rpc
_queue.py", line 85, in _try_exec_task
raise e
File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/rpc
_queue.py", line 81, in _try_exec_task
task_f(*args)
File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/aim/ext/transport/cli
ent.py", line 299, in _run_write_instructions
response = self.remote.run_write_instructions(message_stream_generator(), metadata=self._request_
metadata)
File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/grpc/_channel.py", li
ne 1131, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/hrvoje_stojic_secondmind_ai/hv/.venv/lib/python3.10/site-packages/grpc/_channel.py", li
ne 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Stream removed"
debug_error_string = "{"created":"@1699559804.232117923","description":"Error received from p
eer ipv4:10.91.128.8:443","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Stre
am removed","grpc_status":2}"
To reproduce
I don't have a MWE, it seems stochastic, happens sometimes.
Expected behavior
Tracking is reliable, no metrics/experiments are lost.
Environment
- Aim Version: 3.17.5
- Python version: 3.10
- pip version:
- OS: Linux
- Any other relevant information
Hey @hstojic! Thanks for the report. From the logs above it seems that something goes wrong when using SSL and grpc (we're using grpc for tracking server). I've done some digging, and found some similar issues reported on grpc's repo.
- https://github.com/grpc/grpc/issues/28557
setting GRPC_POLL_STRATEGY environment variable for clients
export GRPC_POLL_STRATEGY=poll
- https://github.com/grpc/grpc/issues/23144#issuecomment-645601687
Let me know if these suggestions helps to resolve the issue. If not, I would ask you to provide a little more details about the setup you're running.
thank you for a quick response, I'll try with that environment variable