opentelemetry-python
                                
                                 opentelemetry-python copied to clipboard
                                
                                    opentelemetry-python copied to clipboard
                            
                            
                            
                        Attaching OTLPLogExporter to the root logger will cause infinite recursive calls if the gRPC destination is down.
Env: opentelemetry-exporter-otel-proto-grpc 1.12.0rc1 python 3.10
Steps to reproduce
- Follow the example code: https://github.com/open-telemetry/opentelemetry-python/blob/main/docs/examples/logs/example.py
log_emitter_provider = LogEmitterProvider(
    resource=Resource.create(
        {
            "service.name": "shoppingcart"
        }
    ),
)
set_log_emitter_provider(log_emitter_provider)
exporter = OTLPLogExporter(insecure=True)
log_emitter_provider.add_log_processor(BatchLogProcessor(exporter))
log_emitter = log_emitter_provider.get_log_emitter(__name__, "0.1")
handler = LoggingHandler(level=logging.NOTSET, log_emitter=log_emitter)
root_logger = logging.getLogger()
root_logger.addHandler(handler)
root_logger.warning("test it out")
- Shutdown the gRPC destination service (local open telemetry collector, in this case)
- run the example code.
What is the expected behavior? Given the fact the open telemetry collector is down. the gRPC should return StatusCode.UNAVAILABLE. It should keep retrying sending the log every a few sec.
What is the actual behavior? The code went into an infinite recursive loop
Additional context This is because the otel export is attached to the root logger. In the opentleemetry/exporter/otlp/proto/grpc/exporter.py when gRPC error happens, it use the logger.warning() to send another log indicating the error, which triggered another gRPC call to send the error log, which will encounter another error, which will also be retried infinitely, this will go into infinite recursive loop.
@bigqi0812 You should notice that there is call to shutdown in the example which must ensure the sdk shuts down by dropping the any pending telemetry. The example doesn't get into infinite. It terminates but the exporter has exponential back off retry mechanism with value 64 seconds. You might feel that it is stuck in infinite loop but the it gets shutdown after all the retries fail. Wait for a minute and you should see that it terminates.
@bigqi0812 You should notice that there is call to shutdown in the example which must ensure the sdk shuts down by dropping the any pending telemetry. The example doesn't get into infinite. It terminates but the exporter has exponential back off retry mechanism with value 64 seconds. You might feel that it is stuck in infinite loop but the it gets shutdown after all the retries fail. Wait for a minute and you should see that it terminates.
Thank you very much for the reply @srikanthccv , I did a test. the Infinite loop still happens. The issue is not about the back off retry loop itself. It's opentelemetry.exporter.otlp.oroto.grcp.exporter._export() function calls another round of logger.warning("Transient error xxxx") in the re-try loop (Line 322)
 except RpcError as error:
                if error.code() in [
                    StatusCode.CANCELLED,
                    StatusCode.DEADLINE_EXCEEDED,
                    StatusCode.RESOURCE_EXHAUSTED,
                    StatusCode.ABORTED,
                    StatusCode.OUT_OF_RANGE,
                    StatusCode.UNAVAILABLE,
                    StatusCode.DATA_LOSS,
                ]:
                    ......
(Line 322)   logger.warning(f"Transient error {error.code()} encountered while exporting span batch, retrying in {delay}s.")
Given the fact that the otel log exporter is attached to my root logger. this logger.warning("Transient error xxx") will trigger an attempt to send log "Transent error xxxx" to otel log exporter, which will fail again, which will comes down to the line 322 again, and send another "Transent error xxxx" and it will never end.
Technically speaking it's not a infinite loop, but an infinite recursive calls.
I temporarily change the logger.warning in Line 322 and line.error in line 328 to print() statement. and it walk-arounds the issue.
I ran the example myself and confirmed that it exits. While the logs generated by exporter are received by Handler and processor but then there is call to provider.shutdown which drops these exporter records and ends the pipeline. How long did you wait before deciding the it is infinite recursive calls?
I waited for about 1m. and it still running
It takes about 70 seconds before exiting. Are you running the same example? Or do you have some different code? I did execute it again and I see it terminates.
I double checked the code I ran. I changed the BatchLogProcessor to SimpleLogProcessor. It still not terminating for 3 mins+ This is my testing code
import logging
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import (
    OTLPLogExporter,
)
from opentelemetry.sdk._logs import (
    LogEmitterProvider,
    LoggingHandler,
    set_log_emitter_provider,
)
from opentelemetry.sdk._logs.export import BatchLogProcessor, SimpleLogProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    ConsoleSpanExporter,
)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(ConsoleSpanExporter())
)
log_emitter_provider = LogEmitterProvider(
    resource=Resource.create(
        {
            "service.name": "shoppingcart",
            "service.instance.id": "instance-12",
        }
    ),
)
set_log_emitter_provider(log_emitter_provider)
exporter = OTLPLogExporter(insecure=True)
# log_emitter_provider.add_log_processor(BatchLogProcessor(exporter))
log_emitter_provider.add_log_processor(SimpleLogProcessor(exporter))
log_emitter = log_emitter_provider.get_log_emitter(__name__, "0.1")
handler = LoggingHandler(level=logging.NOTSET, log_emitter=log_emitter)
# Attach OTLP handler to root logger
logging.getLogger().addHandler(handler)
# Log directly
logging.info("Jackdaws love my big sphinx of quartz.")
# Create different namespaced loggers
logger1 = logging.getLogger("myapp.area1")
logger2 = logging.getLogger("myapp.area2")
logger1.debug("Quick zephyrs blow, vexing daft Jim.")
logger1.info("How quickly daft jumping zebras vex.")
logger2.warning("Jail zesty vixen who grabbed pay from quack.")
logger2.error("The five boxing wizards jump quickly.")
# Trace context correlation
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("foo"):
    # Do something
    logger2.error("Hyderabad, we have a major problem.")
log_emitter_provider.shutdown()
Sorry, I made a mistake accidentally clicked the close button.
It terminates with BatchLogProcessor since the logs are not sent as soon as they come and the code is non blocking i.e it reaches the provider.shutdown() which send the shutdown signal down the pipeline which ensures that program ends (I checked this multiple times and can confirm it exits). But in case if you have added SimpleLogProcessor which is blocking in nature and sends the data as it arrives, it gets into this never ending state of exporting it's own error logs which is technically expected since the shutdown call is never invoked (you can verify this by adding print statement right before the shutdown line).
We can certainly handle this case by providing a way for set ting the propagate=False for sdk's own logger components.
You should notice that there is call to shutdown in the example which must ensure the sdk shuts down by dropping the any pending telemetry. The example doesn't get into infinite. It terminates but the exporter has exponential back off retry mechanism with value 64 seconds. You might feel that it is stuck in infinite loop but the it gets shutdown after all the retries fail. Wait for a minute and you should see that it terminates.
@srikanthccv
Is there a way to reduce this timeout? I am calling shutdown on my TracerProvider.
I would have expected setting export_timeout_millis on my BatchSpanProcessor to have done the trick but it does not.
export_timeout_millis applies only for the call to final export but the retry mechanism is effective only when the original export fails. Currently there is no way to configure this. The only thing holding us back to make it configureable is to complaint with opentelemetry specification and add our own env vars.
We can certainly handle this case by providing a way for set ting the propagate=False for sdk's own logger components.
That wouldn't work for me, I set up logging in my applications using logging.config.dictConfig with incremental=False, which would overwrite your propagation setting.
The standard library defines https://docs.python.org/3/library/logging.html#logging.Handler.handleError which should be used by logging handlers if "something goes wrong during log emission". Following that pattern, every logging call within opentelemetry.exporter.otlp.proto.grpc.exporter.OTLPExporterMixin._export should be replaced with a raised exception, and opentelemetry.sdk._logs._internal.LoggingHandler.emit should be
if not isinstance(self._logger, NoOpLogger):
    try:
        self._logger.emit(self._translate(record))
    except Exception:
        self.handleError(record)
Would that be an option?