logfire icon indicating copy to clipboard operation
logfire copied to clipboard

Sporadic ConnectionResetError on containers

Open yovelcohen opened this issue 6 months ago • 1 comments

Description

Hi, I'm running some mircro services using containers on Azure container apps/jobs. From time to time, but almost on every container run, there are at least a few Connection Errors, I also log to console and than Azure collects those as backup, so I can see these:

requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
ERROR:opentelemetry.sdk.metrics._internal.export:Exception while exporting metrics

On the logfire dashboard a span of a run like this will just get disconnected in the middle and stop sending logs under that span. Some runs show the connection reset error once and than manage to revive the connection and send the logs to logfire and sometimes I'll see that error repeating thourghout the container's life span. It mostly happens on the container jobs and less on the HTTP containers.

For reference, here's how I initiate most of the services:

logfire.configure(
	send_to_logfire="if-token-present",
	token=settings.LOGFIRE_TOKEN,
	service_name='<SERVICE NAME>',
	environment=settings.ENV_TYPE,
	console=logfire.ConsoleOptions(min_log_level="trace", show_project_link=False),
	advanced=logfire.AdvancedOptions(base_url="https://logfire-api.pydantic.dev")
)

I recently added a call to shutdown to see if this helps but it didn't change much:

async def run_some_job():
  queue_client: QueueClient = await get_or_create_queue_client(settings.JOB_QUEUE) 	
  if message := await queue_client.receive_message():
    ret = await _run_message(message, queue_client)
  await queue_client.close()
  logfire.shutdown()


if __name__ == "__main__":
  asyncio.run(run_some_job())

Python, Logfire & OS Versions, related packages (not required)

Logfire version: 3.15.0
OS Version: Debian Bookworm (Debian 12) (Python 3.12 base images variations..)

yovelcohen avatar May 19 '25 17:05 yovelcohen

Humm, we've seen something similar before with render.com, but never found the source.

This is almost certainly not an issue with our backend, but instead with the container environment or the network somewhere between the python process and cloudflare (which all our traffic is proxied through). But I'm afraid I can't be much more help right now.

samuelcolvin avatar May 26 '25 20:05 samuelcolvin

@samuelcolvin thanks :) I'll try with the opentelemetry sdk's team, maybe they'll know a bit more.

yovelcohen avatar May 28 '25 12:05 yovelcohen