opentelemetry-python icon indicating copy to clipboard operation
opentelemetry-python copied to clipboard

Attaching OTLPLogExporter to the root logger will cause infinite recursive calls if the gRPC destination is down.

Open bigqi0812 opened this issue 3 years ago • 11 comments

Env: opentelemetry-exporter-otel-proto-grpc 1.12.0rc1 python 3.10

Steps to reproduce

  1. Follow the example code: https://github.com/open-telemetry/opentelemetry-python/blob/main/docs/examples/logs/example.py
log_emitter_provider = LogEmitterProvider(
    resource=Resource.create(
        {
            "service.name": "shoppingcart"
        }
    ),
)
set_log_emitter_provider(log_emitter_provider)

exporter = OTLPLogExporter(insecure=True)
log_emitter_provider.add_log_processor(BatchLogProcessor(exporter))
log_emitter = log_emitter_provider.get_log_emitter(__name__, "0.1")
handler = LoggingHandler(level=logging.NOTSET, log_emitter=log_emitter)

root_logger = logging.getLogger()
root_logger.addHandler(handler)
root_logger.warning("test it out")
  1. Shutdown the gRPC destination service (local open telemetry collector, in this case)
  2. run the example code.

What is the expected behavior? Given the fact the open telemetry collector is down. the gRPC should return StatusCode.UNAVAILABLE. It should keep retrying sending the log every a few sec.

What is the actual behavior? The code went into an infinite recursive loop

Additional context This is because the otel export is attached to the root logger. In the opentleemetry/exporter/otlp/proto/grpc/exporter.py when gRPC error happens, it use the logger.warning() to send another log indicating the error, which triggered another gRPC call to send the error log, which will encounter another error, which will also be retried infinitely, this will go into infinite recursive loop.

bigqi0812 avatar May 19 '22 06:05 bigqi0812

@bigqi0812 You should notice that there is call to shutdown in the example which must ensure the sdk shuts down by dropping the any pending telemetry. The example doesn't get into infinite. It terminates but the exporter has exponential back off retry mechanism with value 64 seconds. You might feel that it is stuck in infinite loop but the it gets shutdown after all the retries fail. Wait for a minute and you should see that it terminates.

srikanthccv avatar May 19 '22 14:05 srikanthccv

@bigqi0812 You should notice that there is call to shutdown in the example which must ensure the sdk shuts down by dropping the any pending telemetry. The example doesn't get into infinite. It terminates but the exporter has exponential back off retry mechanism with value 64 seconds. You might feel that it is stuck in infinite loop but the it gets shutdown after all the retries fail. Wait for a minute and you should see that it terminates.

Thank you very much for the reply @srikanthccv , I did a test. the Infinite loop still happens. The issue is not about the back off retry loop itself. It's opentelemetry.exporter.otlp.oroto.grcp.exporter._export() function calls another round of logger.warning("Transient error xxxx") in the re-try loop (Line 322)

 except RpcError as error:

                if error.code() in [
                    StatusCode.CANCELLED,
                    StatusCode.DEADLINE_EXCEEDED,
                    StatusCode.RESOURCE_EXHAUSTED,
                    StatusCode.ABORTED,
                    StatusCode.OUT_OF_RANGE,
                    StatusCode.UNAVAILABLE,
                    StatusCode.DATA_LOSS,
                ]:

                    ......
(Line 322)   logger.warning(f"Transient error {error.code()} encountered while exporting span batch, retrying in {delay}s.")

Given the fact that the otel log exporter is attached to my root logger. this logger.warning("Transient error xxx") will trigger an attempt to send log "Transent error xxxx" to otel log exporter, which will fail again, which will comes down to the line 322 again, and send another "Transent error xxxx" and it will never end.

Technically speaking it's not a infinite loop, but an infinite recursive calls.

bigqi0812 avatar May 19 '22 16:05 bigqi0812

I temporarily change the logger.warning in Line 322 and line.error in line 328 to print() statement. and it walk-arounds the issue.

bigqi0812 avatar May 19 '22 16:05 bigqi0812

I ran the example myself and confirmed that it exits. While the logs generated by exporter are received by Handler and processor but then there is call to provider.shutdown which drops these exporter records and ends the pipeline. How long did you wait before deciding the it is infinite recursive calls?

srikanthccv avatar May 19 '22 17:05 srikanthccv

I waited for about 1m. and it still running

bigqi0812 avatar May 19 '22 17:05 bigqi0812

It takes about 70 seconds before exiting. Are you running the same example? Or do you have some different code? I did execute it again and I see it terminates.

srikanthccv avatar May 19 '22 18:05 srikanthccv

I double checked the code I ran. I changed the BatchLogProcessor to SimpleLogProcessor. It still not terminating for 3 mins+ This is my testing code

import logging

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import (
    OTLPLogExporter,
)
from opentelemetry.sdk._logs import (
    LogEmitterProvider,
    LoggingHandler,
    set_log_emitter_provider,
)
from opentelemetry.sdk._logs.export import BatchLogProcessor, SimpleLogProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    ConsoleSpanExporter,
)

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(ConsoleSpanExporter())
)

log_emitter_provider = LogEmitterProvider(
    resource=Resource.create(
        {
            "service.name": "shoppingcart",
            "service.instance.id": "instance-12",
        }
    ),
)
set_log_emitter_provider(log_emitter_provider)

exporter = OTLPLogExporter(insecure=True)
# log_emitter_provider.add_log_processor(BatchLogProcessor(exporter))
log_emitter_provider.add_log_processor(SimpleLogProcessor(exporter))

log_emitter = log_emitter_provider.get_log_emitter(__name__, "0.1")
handler = LoggingHandler(level=logging.NOTSET, log_emitter=log_emitter)

# Attach OTLP handler to root logger
logging.getLogger().addHandler(handler)

# Log directly
logging.info("Jackdaws love my big sphinx of quartz.")

# Create different namespaced loggers
logger1 = logging.getLogger("myapp.area1")
logger2 = logging.getLogger("myapp.area2")

logger1.debug("Quick zephyrs blow, vexing daft Jim.")
logger1.info("How quickly daft jumping zebras vex.")
logger2.warning("Jail zesty vixen who grabbed pay from quack.")
logger2.error("The five boxing wizards jump quickly.")


# Trace context correlation
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("foo"):
    # Do something
    logger2.error("Hyderabad, we have a major problem.")

log_emitter_provider.shutdown()

bigqi0812 avatar May 20 '22 15:05 bigqi0812

Sorry, I made a mistake accidentally clicked the close button.

bigqi0812 avatar May 20 '22 15:05 bigqi0812

It terminates with BatchLogProcessor since the logs are not sent as soon as they come and the code is non blocking i.e it reaches the provider.shutdown() which send the shutdown signal down the pipeline which ensures that program ends (I checked this multiple times and can confirm it exits). But in case if you have added SimpleLogProcessor which is blocking in nature and sends the data as it arrives, it gets into this never ending state of exporting it's own error logs which is technically expected since the shutdown call is never invoked (you can verify this by adding print statement right before the shutdown line).

We can certainly handle this case by providing a way for set ting the propagate=False for sdk's own logger components.

srikanthccv avatar May 20 '22 16:05 srikanthccv

You should notice that there is call to shutdown in the example which must ensure the sdk shuts down by dropping the any pending telemetry. The example doesn't get into infinite. It terminates but the exporter has exponential back off retry mechanism with value 64 seconds. You might feel that it is stuck in infinite loop but the it gets shutdown after all the retries fail. Wait for a minute and you should see that it terminates.

@srikanthccv

Is there a way to reduce this timeout? I am calling shutdown on my TracerProvider.

I would have expected setting export_timeout_millis on my BatchSpanProcessor to have done the trick but it does not.

benjaminclauss avatar Jul 22 '22 16:07 benjaminclauss

export_timeout_millis applies only for the call to final export but the retry mechanism is effective only when the original export fails. Currently there is no way to configure this. The only thing holding us back to make it configureable is to complaint with opentelemetry specification and add our own env vars.

srikanthccv avatar Jul 23 '22 02:07 srikanthccv

We can certainly handle this case by providing a way for set ting the propagate=False for sdk's own logger components.

That wouldn't work for me, I set up logging in my applications using logging.config.dictConfig with incremental=False, which would overwrite your propagation setting.

The standard library defines https://docs.python.org/3/library/logging.html#logging.Handler.handleError which should be used by logging handlers if "something goes wrong during log emission". Following that pattern, every logging call within opentelemetry.exporter.otlp.proto.grpc.exporter.OTLPExporterMixin._export should be replaced with a raised exception, and opentelemetry.sdk._logs._internal.LoggingHandler.emit should be

if not isinstance(self._logger, NoOpLogger):
    try:
        self._logger.emit(self._translate(record))
    except Exception:
        self.handleError(record)

Would that be an option?

a-recknagel avatar Mar 12 '24 14:03 a-recknagel