opentelemetry-js icon indicating copy to clipboard operation
opentelemetry-js copied to clipboard

OTLPExporterError: Request Timeout\n at ClientRequest.

Open pcace opened this issue 1 year ago • 4 comments

Hi there, since i am very new to opentelemetry i hope i can get some feedback here for this problem:

i am trying to use the OTLPTraceExporter from @opentelemetry/exporter-trace-otlp-http to send traces to jaeger-collector or grafana. i have two endpoints where i am sendting the traces: one locally installed jaeger instance and a grafana cloud instance. the probkem i am having is that only a few traces come across and actually make it to the receiving end. most of the traces get suck already on the exporter with this error message:

api | {“stack”:“OTLPExporterError: Request Timeout
 at ClientRequest. (/home/user/git/project/node_modules/@opentelemetry/otlp-exporter-base/src/platform/node/util.ts:147:21)
 at (/home/user/git/project/node_modules/@opentelemetry/context-async-hooks/src/AbstractAsyncHooksContextManager.ts:75:49)
 at AsyncLocalStorage.run (node:async_hooks:338:14)
 at AsyncLocalStorageContextManager.with (/home/user/git/project/node_modules/@opentelemetry/context-async-hooks/src/AsyncLocalStorageContextManager.ts:40:36)
 at ClientRequest.contextWrapper (/home/user/git/project/node_modules/@opentelemetry/context-async-hooks/src/AbstractAsyncHooksContextManager.ts:75:26)
 at ClientRequest.emit (node:events:514:28)
 at TLSSocket.socketCloseListener (node:_http_client:474:11)
 at TLSSocket.emit (node:events:526:35)
 at node:net:323:12
 at TCP.done (node:_tls_wrap:588:7)”,“message”:“Request Timeout”,“name”:“OTLPExporterError”,“code”:“ECONNRESET”}

the implementation part looks like this (excuse these stupid high limits... i just tried ;) ):

const exporter = new OTLPTraceExporter({
  // Update this URL to point to where your OTLP compatible collector is listening
  // The redwood development studio (`yarn rw exp studio`) can collect your telemetry at `http://127.0.0.1:4318/v1/traces`
  url: process.env.OTEL_HOST,
  headers: {
    Authorization:
      'Basic ' +
      Buffer.from(`${process.env.OTEL_USER}:${process.env.OTEL_PASSWORD}`).toString('base64'),
  },
  concurrencyLimit: 100000, // an optional limit on pending requests
})

// You may wish to switch to BatchSpanProcessor in production as it is the recommended choice for performance reasons
const processor = new SimpleSpanProcessor(exporter)

const provider = new NodeTracerProvider({
  resource: resource,
})
provider.addSpanProcessor(processor, {
  // The maximum queue size. After the size is reached spans are dropped.
  maxQueueSize: 200000,
  // The maximum batch size of every export. It must be smaller or equal to maxQueueSize.
  maxExportBatchSize: 100000,
  // The interval between two consecutive exports
  scheduledDelayMillis: 50000,
  // How long the export can run before it is cancelled
  exportTimeoutMillis: 300000,
})

// Optionally register instrumentation libraries here
registerInstrumentations({
  tracerProvider: provider,
  instrumentations: [
    new HttpInstrumentation(),
    new FastifyInstrumentation(),
    new PrismaInstrumentation({
      middleware: true,
    }),
  ],
})

provider.register()

if this is somehow known, i would really appriciate any help! if you need any more information plz let me know!

pcace avatar Feb 01 '24 14:02 pcace

the probkem i am having is that only a few traces come across and actually make it to the receiving end. most of the traces get suck already on the exporter with this error message:

Hmm, I think more info is needed here. :thinking:

Where do a few of those traces show up? On Grafana or on the local Jaeger instance (or both)? If they show up on one but not the other: did you check if the URL is set correctly?

pichlermarc avatar Feb 01 '24 16:02 pichlermarc

Hi, grafana and jaeger show the same results. but i got a little bit further by not using the SimpleSpanProcessor but the BatchSpanProcessor. Also scheduledDelayMillis needs to be as low as possible. it is way better now, but still - the error message is a little bit different now even if i crank the maxQueueSize to 50000 ;) :

api | Dropped 497 spans because maxQueueSize reached
api | Dropped 2760 spans because maxQueueSize reached
api | Dropped 6968 spans because maxQueueSize reached
api | Dropped 4358 spans because maxQueueSize reached
api | Dropped 5408 spans because maxQueueSize reached
api | Dropped 3488 spans because maxQueueSize reached

any default way to find a solution for this?

pcace avatar Feb 01 '24 16:02 pcace

There are a few things to consider:

  • a SimpleSpanProcessor will export once every span is done, it could be that you were simply sending too many requests for the local instance to handle.
  • setting scheduledDelayMillis to a low value will export often, be sure that you're not sending too many requests
  • setting maxExportBatchSize to 100000 is a lot of spans, consider using the defaults. The message that you send to the Jaeger/Grafana will be very large. Many OTLP server implementations will also limit the size of a message that you can send, because parsing and processing multiple-megabyte large messages needs a lot of memory. (if every span is 50bytes*100000 -> 5MB OTLP message (best case), and with JSON, it's likely going to be much, much bigger than that)

Could you try the BatchSpanProcessor with the default options?

Also looking at Grafana Docs it looks like the recommend you to use the OTLP/protobuf exporter (you're using @opentelemetry/exporter-trace-otlp-http the protobuf exporter is @opentelemetry/exporter-trace-otlp-proto`)

Another thing to consider: if your service is sending a lot of spans - consider setting up all your OTel traced services with Sampling. This will choose a subset of your traces to send so that you don't get all the spans, which can also help reduce load (and often saves money)

pichlermarc avatar Feb 02 '24 08:02 pichlermarc

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions[bot] avatar May 13 '24 06:05 github-actions[bot]

This issue was closed because it has been stale for 14 days with no activity.

github-actions[bot] avatar Jun 03 '24 06:06 github-actions[bot]