apm-server icon indicating copy to clipboard operation
apm-server copied to clipboard

Confusing log events for invalid requests

Open up2neck opened this issue 1 year ago • 4 comments

APM Server log events and correlating trace (with self-instrumentation enabled) contains 2 different HTTP errors:

Whether log event contains 503 error with "request timed out" error:

Log entry
{
  "message": "request timed out",
  "url.original": "/intake/v2/events",
  "http.request.body.bytes": 0,
  "log.level": "error",
  "transaction.id": "d91bb495a6ab25bb",
  "http.request.method": "POST",
  "ecs.version": "1.6.0",
  "http.response.status_code": 503,
  "error.message": "request timed out",
  "log.logger": "request",
  "@timestamp": "2024-10-02T07:22:04.439Z",
  "service.name": "apm-server",
  "user_agent.original": "apm-agent-java/1.45.0 (autocode-app 2.0.0)",
  "log.origin": {
    "function": "github.com/elastic/apm-server/internal/beater/api.apmMiddleware.LogMiddleware.func1.1",
    "file.name": "middleware/log_middleware.go",
    "file.line": 59
  },
  "event.duration": 3499790781,
  "http.request.id": "d91bb495a6ab25bb",
  "source.address": "000.000.000.000",
  "trace.id": "d91bb495a6ab25bbdede8859c7cf8c8a"
}

APM interface shown actual error was:

Image

Trace body
{
  "_index": ".ds-traces-apm-dummy_sub2-2024.09.20-000009",
  "_id": "urkbTJIBO4jYgZRmiL6k",
  "_version": 1,
  "_score": 0,
  "_source": {
    "agent": {
      "name": "go",
      "version": "2.6.0"
    },
    "process": {
      "args": [
        "apm-server",
        "run",
        "-e",
        "-c",
        "config/config-secret/apm-server.yml"
      ],
      "pid": 1,
      "title": "apm-server"
    },
    "source": {
      "nat": {
        "ip": "00.000.00.000"
      },
      "ip": "000.000.000.000"
    },
    "processor": {
      "event": "transaction"
    },
    "url": {
      "path": "/intake/v2/events",
      "scheme": "https",
      "domain": "mydomain",
      "full": "https://mydomain/intake/v2/events"
    },
    "labels": {
      "project": "dummy-sub2"
    },
    "cloud": {
      "availability_zone": "europe-west3-a",
      "instance": {
        "name": "gke-dummy-sub-elastic-europe-w-generic-c4e5a328-8ipl",
        "id": "6772592666386103052"
      },
      "provider": "gcp",
      "project": {
        "id": "or2-ms-dummy-sub-elastic-t1iylu"
      },
      "region": "europe-west3"
    },
    "edp": {
      "domain": "mydomain"
    },
    "observer": {
      "hostname": "apm-server-v2-apm-server-6f6db466bd-22ccb",
      "type": "apm-server",
      "version": "8.14.3"
    },
    "trace": {
      "id": "d91bb495a6ab25bbdede8859c7cf8c8a"
    },
    "@timestamp": "2024-10-02T07:22:00.940Z",
    "data_stream": {
      "namespace": "dummy_sub2",
      "type": "traces",
      "dataset": "apm"
    },
    "service": {
      "node": {
        "name": "apm-server-v2-apm-server-6f6db466bd-22ccb"
      },
      "environment": "sandbox",
      "name": "apm-server",
      "runtime": {
        "name": "gc",
        "version": "go1.22.7"
      },
      "language": {
        "name": "go",
        "version": "go1.22.7"
      },
      "version": "8.14.3"
    },
    "host": {
      "hostname": "apm-server-v2-apm-server-6f6db466bd-22ccb",
      "os": {
        "platform": "linux"
      },
      "name": "apm-server-v2-apm-server-6f6db466bd-22ccb",
      "architecture": "amd64"
    },
    "client": {
      "geo": {
        "region_iso_code": "BE-BRU",
        "continent_name": "Europe",
        "city_name": "Brussels",
        "country_iso_code": "BE",
        "country_name": "Belgium",
        "location": {
          "lon": 4.347,
          "lat": 50.8534
        },
        "region_name": "Brussels Capital"
      },
      "ip": "000.000.000.000"
    },
    "http": {
      "request": {
        "headers": {
          "X-Cloud-Trace-Context": [
            "61ce7649b7546eb12e48bd1b05dc5acf/7453092485412124311"
          ],
          "Cache-Control": [
            "no-cache"
          ],
          "Accept": [
            "*/*"
          ],
          "Content-Encoding": [
            "deflate"
          ],
          "X-Forwarded-Proto": [
            "https"
          ],
          "User-Agent": [
            "apm-agent-java/1.45.0 (autocode-app 2.0.0)"
          ],
          "X-Forwarded-For": [
            "000.000.000.000,00.00.00.000"
          ],
          "Pragma": [
            "no-cache"
          ],
          "Via": [
            "1.1 google"
          ],
          "Content-Type": [
            "application/x-ndjson"
          ]
        },
        "method": "POST",
        "cookies_list": [],
        "cookies": {}
      },
      "response": {
        "headers": {
          "X-Content-Type-Options": [
            "nosniff"
          ],
          "Connection": [
            "Close"
          ],
          "Content-Type": [
            "application/json"
          ]
        },
        "status_code": 400
      },
      "version": "1.1"
    },
    "event": {
      "agent_id_status": "missing",
      "ingested": "2024-10-02T07:22:05Z",
      "success_count": 1,
      "outcome": "success"
    },
    "transaction": {
      "result": "HTTP 4xx",
      "duration": {
        "us": 3499844
      },
      "representative_count": 1,
      "name": "POST /intake/v2/events",
      "id": "d91bb495a6ab25bb",
      "span_count": {
        "dropped": 0,
        "started": 0
      },
      "type": "request",
      "sampled": true
    },
    "user_agent": {
      "original": "apm-agent-java/1.45.0 (autocode-app 2.0.0)",
      "name": "Other",
      "device": {
        "name": "Other"
      }
    },
    "span": {
      "id": "d91bb495a6ab25bb"
    },
    "timestamp": {
      "us": 1727853720940194
    }
  },
  "fields": {
    "transaction.name.text": [
      "POST /intake/v2/events"
    ],
    "http.request.headers.Accept": [
      "*/*"
    ],
    "http.response.headers.Connection": [
      "Close"
    ],
    "transaction.representative_count": [
      1
    ],
    "user_agent.original.text": [
      "apm-agent-java/1.45.0 (autocode-app 2.0.0)"
    ],
    "client.geo.country_iso_code": [
      "BE"
    ],
    "host.hostname": [
      "apm-server-v2-apm-server-6f6db466bd-22ccb"
    ],
    "process.pid": [
      1
    ],
    "service.language.name": [
      "go"
    ],
    "transaction.result": [
      "HTTP 4xx"
    ],
    "cloud.availability_zone": [
      "europe-west3-a"
    ],
    "process.title.text": [
      "apm-server"
    ],
    "transaction.id": [
      "d91bb495a6ab25bb"
    ],
    "http.request.headers.Via": [
      "1.1 google"
    ],
    "http.request.method": [
      "POST"
    ],
    "processor.event": [
      "transaction"
    ],
    "labels.project": [
      "dummy-sub2"
    ],
    "source.ip": [
      "000.000.000.000"
    ],
    "agent.name": [
      "go"
    ],
    "host.name": [
      "apm-server-v2-apm-server-6f6db466bd-22ccb"
    ],
    "client.geo.country_name": [
      "Belgium"
    ],
    "event.agent_id_status": [
      "missing"
    ],
    "http.response.status_code": [
      400
    ],
    "http.version": [
      "1.1"
    ],
    "event.outcome": [
      "success"
    ],
    "user_agent.original": [
      "apm-agent-java/1.45.0 (autocode-app 2.0.0)"
    ],
    "cloud.region": [
      "europe-west3"
    ],
    "transaction.duration.us": [
      3499844
    ],
    "service.runtime.version": [
      "go1.22.7"
    ],
    "span.id": [
      "d91bb495a6ab25bb"
    ],
    "edp.domain": [
      "mydomain"
    ],
    "client.ip": [
      "000.000.000.000"
    ],
    "user_agent.name": [
      "Other"
    ],
    "data_stream.type": [
      "traces"
    ],
    "host.architecture": [
      "amd64"
    ],
    "http.request.headers.X-Forwarded-For": [
      "000.000.000.000,00.00.00.000"
    ],
    "cloud.provider": [
      "gcp"
    ],
    "timestamp.us": [
      1727853720940194
    ],
    "url.path": [
      "/intake/v2/events"
    ],
    "observer.type": [
      "apm-server"
    ],
    "observer.version": [
      "8.14.3"
    ],
    "agent.version": [
      "2.6.0"
    ],
    "transaction.name": [
      "POST /intake/v2/events"
    ],
    "process.title": [
      "apm-server"
    ],
    "client.geo.continent_name": [
      "Europe"
    ],
    "http.request.headers.Cache-Control": [
      "no-cache"
    ],
    "http.request.headers.Content-Encoding": [
      "deflate"
    ],
    "service.node.name": [
      "apm-server-v2-apm-server-6f6db466bd-22ccb"
    ],
    "http.request.headers.X-Forwarded-Proto": [
      "https"
    ],
    "url.scheme": [
      "https"
    ],
    "source.nat.ip": [
      "00.000.00.000"
    ],
    "transaction.sampled": [
      true
    ],
    "cloud.instance.id": [
      "6772592666386103052"
    ],
    "trace.id": [
      "d91bb495a6ab25bbdede8859c7cf8c8a"
    ],
    "client.geo.region_name": [
      "Brussels Capital"
    ],
    "client.geo.location": [
      {
        "coordinates": [
          4.347,
          50.8534
        ],
        "type": "Point"
      }
    ],
    "event.success_count": [
      1
    ],
    "transaction.span_count.dropped": [
      0
    ],
    "http.request.headers.Content-Type": [
      "application/x-ndjson"
    ],
    "url.full": [
      "https://mydomain/intake/v2/events"
    ],
    "http.request.headers.User-Agent": [
      "apm-agent-java/1.45.0 (autocode-app 2.0.0)"
    ],
    "service.environment": [
      "sandbox"
    ],
    "client.geo.city_name": [
      "Brussels"
    ],
    "client.geo.region_iso_code": [
      "BE-BRU"
    ],
    "service.name": [
      "apm-server"
    ],
    "data_stream.namespace": [
      "dummy_sub2"
    ],
    "service.runtime.name": [
      "gc"
    ],
    "process.args": [
      "apm-server",
      "run",
      "-e",
      "-c",
      "config/config-secret/apm-server.yml"
    ],
    "observer.hostname": [
      "apm-server-v2-apm-server-6f6db466bd-22ccb"
    ],
    "http.request.headers.X-Cloud-Trace-Context": [
      "61ce7649b7546eb12e48bd1b05dc5acf/7453092485412124311"
    ],
    "url.full.text": [
      "https://mydomain/intake/v2/events"
    ],
    "transaction.type": [
      "request"
    ],
    "transaction.span_count.started": [
      0
    ],
    "event.ingested": [
      "2024-10-02T07:22:05.000Z"
    ],
    "http.request.headers.Pragma": [
      "no-cache"
    ],
    "@timestamp": [
      "2024-10-02T07:22:00.940Z"
    ],
    "service.version": [
      "8.14.3"
    ],
    "host.os.platform": [
      "linux"
    ],
    "data_stream.dataset": [
      "apm"
    ],
    "http.response.headers.Content-Type": [
      "application/json"
    ],
    "http.response.headers.X-Content-Type-Options": [
      "nosniff"
    ],
    "service.language.version": [
      "go1.22.7"
    ],
    "url.domain": [
      "mydomain"
    ],
    "user_agent.device.name": [
      "Other"
    ],
    "cloud.instance.name": [
      "gke-dummy-sub-elastic-europe-w-generic-c4e5a328-8ipl"
    ],
    "cloud.project.id": [
      "or2-ms-dummy-sub-elastic-t1iylu"
    ]
  }
}

up2neck avatar Oct 02 '24 08:10 up2neck

hello @up2neck can you please give a bit more informations on the environment? I see from the trace you're running APM Server v 8.14.3, is Elasticsearch deployed at the same version? The trace body and log body you posted are from documents stored in Elasticsearch?

inge4pres avatar Oct 02 '24 10:10 inge4pres

@inge4pres

Log entry was taken from APM Server logs (stdout, collected by GCP Loggins if it matters) Trace was taken from Elasticsearch itself.

Our team, has 2 environments (both facing the issue): GCP Load balancer -> APM Server (GKE, ECK) -> Elasticsearch cluster (GKE, ECK) GCP Load balancer -> APM Server (GKE, ECK) -> Elasticsearch cluster (Elastic Cloud)

My thoughts are some middleware, probably, rewrites the exact error, which leads to incorrect log entry.

up2neck avatar Oct 02 '24 10:10 up2neck

Thanks for the additional info 🙏🏼

The reason I was asking is that you may have transforms or ingest pipelines in Elasticsearch that remaps fields, in this case the http.response.status_code field. Do you know if there is special handling of this field?

My thoughts are some middleware, probably, rewrites the exact error, which leads to incorrect log entry.

I am not aware of anything inside apm-server binary modifying the value of this field, hence my confusion as to why it is different in logs than what's stored in ES as trace. I'll verify this is bug and report back.

inge4pres avatar Oct 02 '24 14:10 inge4pres

@inge4pres

There are few additional steps on the top of default "APM" integration-provided ingest pipelines, we have, but all of them don't interact with http.response.status_code field, keeps it original.

up2neck avatar Oct 03 '24 05:10 up2neck

@up2neck I am back at looking into the code on the reported discrepancy between log entry and trace entry. what I see is that the logger is fed with an response.errors.timeout keyword, and then we are logging the result as recorded by the HTTP request

https://github.com/inge4pres/apm-server/blob/9c7439f30167dcd7d4bd44e0473b0f249ce5fa8d/internal/beater/middleware/log_middleware.go#L101

so the message "request timed out" and a corresponding 503 http status code is coming from the HTTP request processing itself, there is no rewrite in the logger.

I also noticed that the request did time out, matching the event.duration and the trace content.

log event

  "event.duration": 3499790781,

trace event

    "transaction": {
      "result": "HTTP 4xx",
      "duration": {
        "us": 3499844
      },

Since I am noticing you have Tail Based Sampling enabled, would you please share the TBS configuration? I am leaning towards the trace event being mixed up, rather than the log event being wrong.

inge4pres avatar Nov 04 '24 11:11 inge4pres

@inge4pres One interesting thing, that we don't use any TBS configuration in our manifests - APM Server is deployed with ECK operator, including the following config customization only:

    config:
      apm-server:
        auth:
          anonymous:
            allow_agent:
            - rum-js
            - rum-js-dpeo
            - js-base
            - java
            - dotnet
            - php
            - opentelemetry/cpp
            - python
            - otlp
            - go
            - opentelemetry
            - opentelemetry/webjs
            - opentelemetry/js
            - opentelemetry/go
            - opentelemetry/java
            - opentelemetry/nodejs
            - opentelemetry/dotnet
            - nodejs
            - '@microlabs/otel-workers-sdk/js'
            enabled: true
            rate_limit:
              event_limit: 8000
              ip_limit: 1000
          api_key:
            enabled: false
            limit: 100
        capture_personal_data: true
        default_service_environment: undefined
        expvar.enabled: false
        host: 0.0.0.0:8200
        idle_timeout: 45s
        max_connections: 0
        max_event_size: 307200
        max_header_size: 1048576
        pprof.enabled: false
        read_timeout: 600s
        rum:
          allow_headers:
          - x-requested-with
          - access-control-request-private-network
          - access-control-allow-origin
          - xmlhttprequest
          - request-origin
          allow_origins:
          - '*'
          enabled: true
          exclude_from_grouping: ^/webpack
          library_pattern: node_modules|bower_components|~
        shutdown_timeout: 30s
        ssl:
          supported_protocols:
          - TLSv1.2
          - TLSv1.3
        write_timeout: 30s
      logging.level: warning
      monitoring.elasticsearch: {}

up2neck avatar Nov 06 '24 14:11 up2neck

mhmhm ok... I thought TBS was on because I saw this in the trace body

    "transaction": {
      "result": "HTTP 4xx",
      "duration": {
        "us": 3499844
      },
      "representative_count": 1,
      "name": "POST /intake/v2/events",
      "id": "d91bb495a6ab25bb",
      "span_count": {
        "dropped": 0,
        "started": 0
      },
      "type": "request",
      "sampled": true
    },

that "sampled": true should be signaling sampling is on, but honestly I am not sure.

inge4pres avatar Nov 06 '24 16:11 inge4pres

@inge4pres that's set also for head-based sampled transactions.

axw avatar Nov 07 '24 02:11 axw

@inge4pres Got it, I will disable sampling and check the log and trace correlation.

up2neck avatar Nov 18 '24 07:11 up2neck

@up2neck let's recap what we have here.

Since I am unable to replicate the reported behavior in an Elastic Cloud ESS instance nor locally, please confirm my understanding below is correct.

A Java client app (client in the sense of APM client agent) sends an event into APM Server with an HTTP request that times out at ~3.5 seconds. The timeout happens when APM Server is processing the request inside the /intake/v2/events handler, but we still get a valid trace stored in Elasticsearch. This could be due to the request context expiring or being canceled while Elasticsearch is processing the document: the APM Server reports a failure, but the document is actually stored by ES at a later time. The trace ID reported in the trace event (generated by the Java client) is persisted in Elasticsearch, together with the original HTTP status code 4xx. Normally, this event would not be logged on error. It does so because we have an error during processing the request. The logged event reports a different status code than the original trace event (503 in the log event vs 400 in the trace event).

Is it plausible that, the connection from APM Server to Elasticsearch being interrupted exercises this behavior? My assumption is that the trace represents the reply from APM Server HTTP server to java client, whereas the log message represents the processing of the event in the handler that failed due to timeout.

I am still puzzled to what's actually happening to be honest, because the log message reports a duration smaller than the trace by ~500 microseconds.

log event

  "event.duration": 3499790781,

trace event

    "event": {
      "agent_id_status": "missing",
      "ingested": "2024-10-02T07:22:05Z",
      "success_count": 1,
      "outcome": "success"
    },
    "transaction": {
      "result": "HTTP 4xx",
      "duration": {
        "us": 3499844
      },

The event.outcome field is set to "success", and event.ingested is the time the event was flushed to Elasticsearch ("ingested": "2024-10-02T07:22:05Z",), but the log message is prior to that ("@timestamp": "2024-10-02T07:22:04.439Z"). the trace event correctly reports its own timestamp ("@timestamp": "2024-10-02T07:22:00.940Z").

You said you fetched the trace document directly from ES, which makes sense in the scenario that I am imagining: trace event processed by APM in 3.5 seconds, responded with a 4xx to the client, ingestion into Elasticsearch times out or connection is truncated, log message produced for the same trace, event lands in ES anyway.

Do you have telemetry in your Kubernetes environment that is able to spot network connections being reset as part of pod to pod communication? I honestly don't think we have an issue in apm-server as the logged message and the trace body are for 2 different events, AFAICT.

inge4pres avatar Dec 02 '24 18:12 inge4pres

@up2neck thanks for the bug report. I managed to reproduce it.

apm-server request log

{"log.level":"error","@timestamp":"2024-12-07T00:18:06.391Z","log.logger":"request","log.origin":{"function":"github.com/elastic/apm-server/internal/beater/api.apmMiddleware.LogMiddleware.func1.1","file.name":"middleware/log_middleware.go","file.line":59},"message":"request timed out","service.name":"apm-server","url.original":"/intake/v2/events","http.request.method":"POST","user_agent.original":"curl/8.5.0","source.address":"127.0.0.1","trace.id":"7d605e1eeeed169c7580ea71d20cfa3c","transaction.id":"7d605e1eeeed169c","http.request.id":"7d605e1eeeed169c","event.duration":266192314,"http.request.body.bytes":303694,"http.response.status_code":503,"error.message":"request timed out","ecs.version":"1.6.0"}

apm-server self-instrumentation trace document

{
  "_index": ".ds-traces-apm-default-2024.12.06-000001",
  "_id": "qU57npMBrE4DWVdURCBD",
  "_version": 1,
  "_score": 0,
  "_source": {
    "agent": {
      "name": "go",
      "version": "2.6.0"
    },
    "process": {
      "args": [
        "/home/carson/.cache/JetBrains/GoLand2024.3/tmp/GoLand/___main_local_issue_14232",
        "-e",
        "-c",
        "apm-server-issue-14232.yml"
      ],
      "parent": {
        "pid": 21362
      },
      "pid": 1096402,
      "title": "___main_local_i"
    },
    "source": {
      "ip": "127.0.0.1"
    },
    "processor": {
      "event": "transaction"
    },
    "url": {
      "path": "/intake/v2/events",
      "scheme": "http",
      "port": 8200,
      "domain": "localhost",
      "full": "http://localhost:8200/intake/v2/events"
    },
    "observer": {
      "hostname": "carson-elastic",
      "type": "apm-server",
      "version": "8.14.3"
    },
    "trace": {
      "id": "7d605e1eeeed169c7580ea71d20cfa3c"
    },
    "@timestamp": "2024-12-07T00:18:06.125Z",
    "data_stream": {
      "namespace": "default",
      "type": "traces",
      "dataset": "apm"
    },
    "service": {
      "node": {
        "name": "carson-elastic"
      },
      "name": "apm-server",
      "runtime": {
        "name": "gc",
        "version": "go1.23.0"
      },
      "language": {
        "name": "go",
        "version": "go1.23.0"
      },
      "version": "8.14.3"
    },
    "host": {
      "hostname": "carson-elastic",
      "os": {
        "platform": "linux"
      },
      "name": "carson-elastic",
      "architecture": "amd64"
    },
    "client": {
      "ip": "127.0.0.1"
    },
    "http": {
      "request": {
        "headers": {
          "Accept": [
            "*/*"
          ],
          "User-Agent": [
            "curl/8.5.0"
          ],
          "Content-Length": [
            "303694"
          ],
          "Content-Type": [
            "application/x-ndjson"
          ]
        },
        "method": "POST"
      },
      "response": {
        "headers": {
          "X-Content-Type-Options": [
            "nosniff"
          ],
          "Connection": [
            "Close"
          ],
          "Content-Type": [
            "application/json"
          ]
        },
        "status_code": 400
      },
      "version": "1.1"
    },
    "event": {
      "success_count": 1,
      "outcome": "success"
    },
    "transaction": {
      "result": "HTTP 4xx",
      "duration": {
        "us": 266341
      },
      "representative_count": 1,
      "name": "POST /intake/v2/events",
      "id": "7d605e1eeeed169c",
      "span_count": {
        "dropped": 0,
        "started": 11
      },
      "type": "request",
      "sampled": true
    },
    "user_agent": {
      "original": "curl/8.5.0",
      "name": "curl",
      "device": {
        "name": "Other"
      },
      "version": "8.5.0"
    },
    "span": {
      "id": "7d605e1eeeed169c"
    },
    "timestamp": {
      "us": 1733530686125111
    }
  },
  "fields": {
    "http.request.headers.Content-Length": [
      "303694"
    ],
    "transaction.name.text": [
      "POST /intake/v2/events"
    ],
    "http.request.headers.Accept": [
      "*/*"
    ],
    "http.response.headers.Connection": [
      "Close"
    ],
    "transaction.representative_count": [
      1
    ],
    "user_agent.original.text": [
      "curl/8.5.0"
    ],
    "process.parent.pid": [
      21362
    ],
    "host.hostname": [
      "carson-elastic"
    ],
    "process.pid": [
      1096402
    ],
    "service.language.name": [
      "go"
    ],
    "transaction.result": [
      "HTTP 4xx"
    ],
    "process.title.text": [
      "___main_local_i"
    ],
    "transaction.id": [
      "7d605e1eeeed169c"
    ],
    "http.request.method": [
      "POST"
    ],
    "processor.event": [
      "transaction"
    ],
    "source.ip": [
      "127.0.0.1"
    ],
    "agent.name": [
      "go"
    ],
    "host.name": [
      "carson-elastic"
    ],
    "user_agent.version": [
      "8.5.0"
    ],
    "http.response.status_code": [
      400
    ],
    "http.version": [
      "1.1"
    ],
    "event.outcome": [
      "success"
    ],
    "user_agent.original": [
      "curl/8.5.0"
    ],
    "transaction.duration.us": [
      266341
    ],
    "service.runtime.version": [
      "go1.23.0"
    ],
    "span.id": [
      "7d605e1eeeed169c"
    ],
    "client.ip": [
      "127.0.0.1"
    ],
    "user_agent.name": [
      "curl"
    ],
    "data_stream.type": [
      "traces"
    ],
    "host.architecture": [
      "amd64"
    ],
    "timestamp.us": [
      1733530686125111
    ],
    "url.path": [
      "/intake/v2/events"
    ],
    "observer.type": [
      "apm-server"
    ],
    "observer.version": [
      "8.14.3"
    ],
    "agent.version": [
      "2.6.0"
    ],
    "transaction.name": [
      "POST /intake/v2/events"
    ],
    "process.title": [
      "___main_local_i"
    ],
    "service.node.name": [
      "carson-elastic"
    ],
    "url.scheme": [
      "http"
    ],
    "transaction.sampled": [
      true
    ],
    "trace.id": [
      "7d605e1eeeed169c7580ea71d20cfa3c"
    ],
    "event.success_count": [
      1
    ],
    "transaction.span_count.dropped": [
      0
    ],
    "url.port": [
      8200
    ],
    "http.request.headers.Content-Type": [
      "application/x-ndjson"
    ],
    "url.full": [
      "http://localhost:8200/intake/v2/events"
    ],
    "http.request.headers.User-Agent": [
      "curl/8.5.0"
    ],
    "service.name": [
      "apm-server"
    ],
    "data_stream.namespace": [
      "default"
    ],
    "service.runtime.name": [
      "gc"
    ],
    "process.args": [
      "/home/carson/.cache/JetBrains/GoLand2024.3/tmp/GoLand/___main_local_issue_14232",
      "-e",
      "-c",
      "apm-server-issue-14232.yml"
    ],
    "observer.hostname": [
      "carson-elastic"
    ],
    "url.full.text": [
      "http://localhost:8200/intake/v2/events"
    ],
    "transaction.type": [
      "request"
    ],
    "transaction.span_count.started": [
      11
    ],
    "@timestamp": [
      "2024-12-07T00:18:06.125Z"
    ],
    "service.version": [
      "8.14.3"
    ],
    "host.os.platform": [
      "linux"
    ],
    "data_stream.dataset": [
      "apm"
    ],
    "http.response.headers.Content-Type": [
      "application/json"
    ],
    "http.response.headers.X-Content-Type-Options": [
      "nosniff"
    ],
    "service.language.version": [
      "go1.23.0"
    ],
    "url.domain": [
      "localhost"
    ],
    "user_agent.device.name": [
      "Other"
    ]
  }
}

Notice the same trace id 7d605e1eeeed169c7580ea71d20cfa3c.

The setup I use to reproduce this is

  • local elasticsearch 8.14.3
    • with 1g heap, to slow indexing
  • local apm-server 8.14.3
    • self instrumentation enabled
    • create an indexing bottleneck in apm-server by setting output.elasticsearch.flush_bytes to a low value, e.g. 1b, where a min of 24kb will kick in
    • low max_event_size e.g. 10000
  • a lot of go agents sending high traffic to apm-server
  • a curl command to intake v2 endpoint with lower timeout --max-time and a payload containing an oversized event in the middle, e.g.
curl --location 'http://localhost:8200/intake/v2/events' --header 'Content-Type: application/x-ndjson' --data-binary "@events2.ndjson" -v --max-time 1

The effect of this is that while the request from curl will be seen and logged as 503 by apm-server because the client (curl) terminated the conection, the actual request response, if the client was alive, would be 400, because of a line exceeding the max_event_size limit.

As suspected in the initial bug report, this is caused by intake v2 endpoint middleware order and response code manipulation.

As for the actual scenario that caused what @up2neck observed, we'll need to dig deeper for a more plausible root cause. I have only tried with max_event_size, which may not be realistic as java agent might not send such a large event, and apm-server have a high enough default max_event_size.

carsonip avatar Dec 07 '24 00:12 carsonip

I've updated the above comment to avoid some confusion.

~~TLDR of the issue is that because of ES backpressure, APM agent request times out and terminates the connection. apm-server TimeoutMiddleware thinks that ctx is canceled, LogMiddleware logs it as 503, writeStreamResult changes it to 400 because of an invalid line. Therefore, self instrumentation tracer thinks the response is 400, while it is logged as 503 and nothing is actually returned to the agent.~~

The issue is now clear and should be ready for the team to move forward to fix the bug.

carsonip avatar Dec 09 '24 12:12 carsonip

@carsonip @inge4pres I collected some more examples of non-consistent logs and traces. Unfortunately, most of the payload either obfuscated or removed due to the company privacy policy. But even though, hope it would be useful.

evidence_sanitized.zip

up2neck avatar Dec 13 '24 12:12 up2neck

I collected some more examples of non-consistent logs and traces.

@up2neck thanks a lot for that 🙏🏼

inge4pres avatar Dec 23 '24 10:12 inge4pres

@inge4pres raised some questions about my explanation and we managed to get to the bottom of it. I've crossed out some incorrect parts in my previous comment. I had a wrong assumption on the actual middleware orders, although I managed to reproduce the problem correctly.

The real issue here is that the 4xx was set (setting c.Result) and sent (using c.WriteResult()) to the client (which may or may not be disconnected) via writeResult, and therefore is intercepted by apmhttp (Elastic APM agent from self instrumentation). On the other hand, timeout and log middlewares read and write c.Result but never use c.WriteResult() to send out the modified result. These add up to the discrepancy in status code where the status code apparently managed to "skip" some middlewares.

In other words, timeout middleware is missing c.WriteResult, but even if that exists, it would conflict with the c.WriteResult inside the intakev2 handler writeStreamResult.

Image

@inge4pres and I are working on a fix.

carsonip avatar Dec 30 '24 14:12 carsonip

@carsonip

Great news! Thank you for detailed explanation, it's very useful to understand some parts of APM Server and how them work.

up2neck avatar Jan 01 '25 19:01 up2neck

@carsonip I got a though crossed my mind - may such behavior affect APM Server metrics? Cause it potentially may has worse outcomes for end-users, if metrics aren't relevant.

up2neck avatar Jan 06 '25 17:01 up2neck

Hello @up2neck, for your reference the fix for this issue is merged in https://github.com/elastic/apm-server/pull/15123 and should be distributed as a part of the next patch release.

1pkg avatar Jan 07 '25 19:01 1pkg

@1pkg Hi! Thank you, I'm looking forward to the next release.

up2neck avatar Jan 07 '25 19:01 up2neck