opentelemetry-ruby icon indicating copy to clipboard operation
opentelemetry-ruby copied to clipboard

Improve error reporting / debugging UX with the OTLP default/HTTP exporters

Open chen-anders opened this issue 5 months ago • 2 comments

addresses: https://github.com/open-telemetry/opentelemetry-ruby/issues/1931

This PR significantly enhances the debugging experience for OTLP exporters by:

  1. Adding rich context to export failure results
  2. Introducing comprehensive debug-level logging throughout the export pipeline
  3. Maintaining full backwards compatibility with existing exporter implementations

These changes ended up helping me debug a really gnarly issue where a slightly old version of the sentry-ruby SDK was causing issues with how the OpenTelemetry ruby SDK was bubbling up errors due to incorrect IPv6 parsing - causing all my traces to be dropped with an one-line error Unable to export X spans.

Reviewer's Note

Significant AI assistance was used in the process of getting this PR working.

Motivation

Previously, when OTLP exports failed, developers had minimal information to diagnose the root cause. The exporters simply returned a FAILURE constant without any context about:

  • What type of error occurred
  • HTTP response codes and messages
  • Response bodies from the collector
  • Retry attempts and their outcomes
  • Exception details

This made troubleshooting production issues extremely difficult, especially for:

  • Network connectivity problems
  • SSL/TLS certificate issues
  • Collector endpoint configuration errors
  • HTTP timeout scenarios
  • Server-side errors (4xx/5xx responses)

Changes

1. Enhanced Export Result Type (sdk/lib/opentelemetry/sdk/trace/export.rb)

Introduced a new ExportResult class that wraps result codes with optional error context:

class ExportResult
  attr_reader :code, :error, :message

  # Factory methods
  def self.success
  def self.failure(error: nil, message: nil)
  def self.timeout
end

Backwards Compatibility: The ExportResult class overloads the == operator and provides to_i to ensure existing code comparing results to SUCCESS, FAILURE, or TIMEOUT constants continues to work seamlessly.

2. Comprehensive Debug Logging

Added detailed debug-level logging at key points in the export pipeline:

Entry/Exit Points

  • Function entry with parameters (span count, timeout values)
  • Function exit with return values
  • Byte sizes (compressed vs uncompressed)

HTTP Request Flow

  • Request preparation and compression
  • Timeout calculations and retry counts
  • HTTP response codes and messages
  • Response bodies for error cases

Exception Handling

  • Exception type and message for all caught exceptions
  • Retry attempt tracking
  • Max retry exceeded scenarios

3. Rich Failure Context

All failure scenarios now return detailed context via Export.failure():

HTTP Error Responses

OpenTelemetry::SDK::Trace::Export.failure(
  message: "export failed with HTTP #{response.code} (#{response.message}) after #{retry_count} retries: #{body}"
)

Network Exceptions

OpenTelemetry::SDK::Trace::Export.failure(
  error: e,
  message: "export failed due to SocketError after #{retry_count} retries: #{e.message}"
)

Timeout Scenarios

OpenTelemetry::SDK::Trace::Export.failure(
  message: 'timeout exceeded before sending request'
)

4. Enhanced BatchSpanProcessor Error Reporting

Updated BatchSpanProcessor to extract and log error context:

def report_result(result_code, span_array, error: nil, message: nil)
  if result_code == SUCCESS
    # ... metrics ...
  else
    error_message = if error
                "BatchSpanProcessor: export failed due to #{error.class}: #{error.message}"
              elsif message
                "BatchSpanProcessor: export failed: #{message}"
              else
                "BatchSpanProcessor: export failed (no error details available) \n Call stack: #{caller.join("\n")}"
              end

     OpenTelemetry.handle_error(exception: ExportError.new(span_array), message: error_message)
  end
end

5. Updated Exporters

Applied consistent changes to both:

  • OTLP default Exporter (exporter/otlp/lib/opentelemetry/exporter/otlp/exporter.rb)
  • OTLP HTTP Exporter (exporter/otlp-http/lib/opentelemetry/exporter/otlp/http/trace_exporter.rb)

Both now capture exception objects and maintain the error context through the entire export pipeline.

Example Scenarios

Before

ERROR -- : OpenTelemetry error: Unable to export 10 spans

After (with debug logging enabled)

DEBUG -- : OTLP::Exporter#export: Called with 10 spans, timeout=30.0
DEBUG -- : OTLP::Exporter#export: Calling encode for 10 spans
DEBUG -- : OTLP::Exporter#send_bytes: Sending HTTP request
DEBUG -- : OTLP::Exporter#send_bytes: Caught SocketError: Connection refused, retry_count=1
DEBUG -- : OTLP::Exporter#send_bytes: Max retries exceeded for SocketError
ERROR -- : BatchSpanProcessor: export failed due to SocketError: Connection refused - connect(2) for "localhost" port 4318
ERROR -- : OpenTelemetry error: Unable to export 10 spans

chen-anders avatar Oct 12 '25 13:10 chen-anders

Random passerby here ~ just want to say thank you @chen-anders! I am knee-deep debugging errors between my ruby app and my OTLP collector and the improvements in this PR would vastly help my efforts.

tkling avatar Oct 16 '25 13:10 tkling

👋 This pull request has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this pull request will be closed eventually by the stale bot

github-actions[bot] avatar Nov 30 '25 02:11 github-actions[bot]