opentelemetry-python icon indicating copy to clipboard operation
opentelemetry-python copied to clipboard

Sporadic Connection Errors on Azure containers apps/jobs

Open yovelcohen opened this issue 6 months ago • 3 comments

Describe your environment

OS: Various Linux distros (different base images used) Python version: Python 3.11/3.12 SDK version: 1.32.1

What happened?

I'm using logfire as my logging library, which is a wrapper on top of the opentelemetry sdk. We run microservices on azure container apps/jobs.

Sometimes, especially with jobs, the logging process fails with the following error:

requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
ERROR:opentelemetry.sdk.metrics._internal.export:Exception while exporting metrics

This is very sporadic, one job can log everything successfully, one won't be able to log at all, and some would stop mid-way through the run, leading to partial spans. I've discussed this issue with the logfire team here and they claim it's not an issue with their backend. I'm still not sure that's 100% accurate but I thought maybe here I'll find an idea as to why it happens.

Steps to Reproduce

If it helps, here's how I setup my logfire configuration (it sets up otel behind the scenes):

logfire.configure(
	send_to_logfire="if-token-present",
	token=settings.LOGFIRE_TOKEN,
	service_name='SomeJob',
	environment=settings.ENV_TYPE,
	console=logfire.ConsoleOptions(min_log_level="trace", show_project_link=False),
	advanced=logfire.AdvancedOptions(base_url="https://logfire-api.pydantic.dev")
)

Expected Result

logging is consistent and the connection doesn't interrupt midway.

Actual Result

sporadic connection errors.

Additional context

No response

Would you like to implement a fix?

None

yovelcohen avatar Jun 11 '25 13:06 yovelcohen

That error is coming from the OpenTelemetry exporter being used in logfire, the actual error is a common error when the server cannot handle the requests because of being to busy, network instability or also some firewall rules could be in place. This should not cause any process to fail only to add extra logs, is that what you are experiencing?

hectorhdzg avatar Jun 12 '25 21:06 hectorhdzg

@hectorhdzg the logfire team claims there's nothing wrong with their server/cloudflare proxy and other users with larger scales on other platforms then azure container apps are not reporting this issue at all. To your question, no, I don't see extra logs, I either don't see anything/I see partial spans or it simply works. As I mentioned it's completely sporadic.

yovelcohen avatar Jun 15 '25 08:06 yovelcohen

@yovelcohen this kind of issues are usually painful to figure out the root cause, I'm not familiar on how logfire is using OpenTelemetry, including what kind of exporter is being used there, getting some more details may give more insight on what could be happening here, looks like you are getting the error in both metrics and traces, so maybe some sampling in traces could help if backend is getting too many requests.

hectorhdzg avatar Jun 16 '25 17:06 hectorhdzg

I'm also getting this. It started like this week for me, I'm not sure it was caused by some update or what

edit: just to add, I'm not using Azure, but the error is still in loop for me...

NathanAP avatar Jul 03 '25 13:07 NathanAP