logfire
logfire copied to clipboard
Sporadic ConnectionResetError on containers
Description
Hi, I'm running some mircro services using containers on Azure container apps/jobs. From time to time, but almost on every container run, there are at least a few Connection Errors, I also log to console and than Azure collects those as backup, so I can see these:
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
ERROR:opentelemetry.sdk.metrics._internal.export:Exception while exporting metrics
On the logfire dashboard a span of a run like this will just get disconnected in the middle and stop sending logs under that span. Some runs show the connection reset error once and than manage to revive the connection and send the logs to logfire and sometimes I'll see that error repeating thourghout the container's life span. It mostly happens on the container jobs and less on the HTTP containers.
For reference, here's how I initiate most of the services:
logfire.configure(
send_to_logfire="if-token-present",
token=settings.LOGFIRE_TOKEN,
service_name='<SERVICE NAME>',
environment=settings.ENV_TYPE,
console=logfire.ConsoleOptions(min_log_level="trace", show_project_link=False),
advanced=logfire.AdvancedOptions(base_url="https://logfire-api.pydantic.dev")
)
I recently added a call to shutdown to see if this helps but it didn't change much:
async def run_some_job():
queue_client: QueueClient = await get_or_create_queue_client(settings.JOB_QUEUE)
if message := await queue_client.receive_message():
ret = await _run_message(message, queue_client)
await queue_client.close()
logfire.shutdown()
if __name__ == "__main__":
asyncio.run(run_some_job())
Python, Logfire & OS Versions, related packages (not required)
Logfire version: 3.15.0
OS Version: Debian Bookworm (Debian 12) (Python 3.12 base images variations..)
Humm, we've seen something similar before with render.com, but never found the source.
This is almost certainly not an issue with our backend, but instead with the container environment or the network somewhere between the python process and cloudflare (which all our traffic is proxied through). But I'm afraid I can't be much more help right now.
@samuelcolvin thanks :) I'll try with the opentelemetry sdk's team, maybe they'll know a bit more.