apm-server icon indicating copy to clipboard operation
apm-server copied to clipboard

apm-server exits for unknown reason, logs can see otpl related stack

Open leehow1988 opened this issue 1 year ago • 3 comments

APM Server version (8.8.1 and 8.10.0 and 8.14.3):

apm-server exits for unknown reason, logs can see otpl related stack

Steps to reproduce:

We have a lot of services in our test environment, some of them use OpenTelemetry, some of them use Elastic APM Agent We found today that apm-server exits shortly after starting, as shown in the attached file

I started ELK using docker-compose,yaml, and the part about apm-server is as follows:

  apm-server:
    image: docker.elastic.co/apm/apm-server:8.14.3
    cap_add: ["CHOWN", "DAC_OVERRIDE", "SETGID", "SETUID"]
    cap_drop: ["ALL"]
    ports:
      - 8200:8200
    depends_on:
      - elasticsearch
      - kibana
    networks:
      - elk
    volumes:
      - ./apm-server/config/apm-server.yml:/usr/share/apm-server/apm-server.yml
      - /home/elk_data/apm-server/logs:/usr/share/apm-server/logs
      - /home/elk_data/apm-server/data:/usr/share/apm-server/data
    healthcheck:
      interval: 10s
      retries: 12
      test: curl --write-out 'HTTP %{http_code}' --fail --silent --output /dev/null http://localhost:8200/

apm-server.yaml content

apm-server:
  host: "0.0.0.0:8200"

  kibana:
    enabled: true
    host: "192.168.3.155:5601"
    username: "elastic"
    password: "***"

#-------------------------- Elasticsearch output --------------------------
output.elasticsearch:
  hosts: ["192.168.3.155:9200"]
  enabled: true
  username: "elastic"
  password: "***"

docker inspect elk-apm-server-1 |grep ExitCode -C 10

"State": {
    "Status": "exited",
    "Running": false,
    "Paused": false,
    "Restarting": false,
    "OOMKilled": false,
    "Dead": false,
    "Pid": 0,
    "ExitCode": 2,
    "Error": "",
    "StartedAt": "2024-07-16T08:08:36.738792834Z",
    "FinishedAt": "2024-07-16T08:11:38.74994692Z",
    "Health": {
        "Status": "unhealthy",
        "FailingStreak": 17,
        "Log": [
            {
                "Start": "2024-07-16T16:10:52.268632964+08:00",
                "End": "2024-07-16T16:10:52.7597741+08:00",
                "ExitCode": 127,
                "Output": "/bin/sh: 1: curl: not found\n"
            },
            {
                "Start": "2024-07-16T16:11:02.763336696+08:00",
                "End": "2024-07-16T16:11:02.860034212+08:00",
                "ExitCode": 127,
                "Output": "/bin/sh: 1: curl: not found\n"
            },
            {
                "Start": "2024-07-16T16:11:12.891897649+08:00",
                "End": "2024-07-16T16:11:12.998785585+08:00",
                "ExitCode": 127,
                "Output": "/bin/sh: 1: curl: not found\n"
            },
            {
                "Start": "2024-07-16T16:11:23.005486378+08:00",
                "End": "2024-07-16T16:11:23.096787177+08:00",
                "ExitCode": 127,
                "Output": "/bin/sh: 1: curl: not found\n"
            },
            {
                "Start": "2024-07-16T16:11:33.098991014+08:00",
                "End": "2024-07-16T16:11:33.258413466+08:00",
                "ExitCode": 127,
                "Output": "/bin/sh: 1: curl: not found\n"
            }
        ]
    }
}

Provide logs (if relevant): see attachment file

error.log

leehow1988 avatar Jul 16 '24 09:07 leehow1988

The core of the issue is shown in your error log:

panic: interface conversion: interface {} is nil, not string
goroutine 3684 [running]:
github.com/elastic/apm-data/input/otlp.setLabel({0xc00502e960, 0x1b}, 0xc003ccef20, {0x1c9da00?, 0xc000ca7710?})
        github.com/elastic/[email protected]/input/otlp/metadata.go:508 +0x5f1
github.com/elastic/apm-data/input/otlp.TranslateTransaction.func1({0xc004fcc600, 0x1b}, {0xc0012a8e50?, 0xc0050564dc?})
        github.com/elastic/[email protected]/input/otlp/traces.go:290 +0x8be
go.opentelemetry.io/collector/pdata/pcommon.Map.Range({0xc0041ebaa0?, 0xc0050564dc?}, 0xc000b7a978)
        go.opentelemetry.io/collector/[email protected]/pcommon/map.go:222 +0x97

I managed to reproduce the issue by crafting an invalid otlp payload. It is invalid because according to OTel spec, arrays MUST be homogeneous, and nulls SHOULD NOT be allowed in arrays.

Please double check your OTel collectors and understand why they are sending attributes that do not comply to OTel spec. At the same time, we will improve handling in apm-server to handle this kind of invalid payload.

carsonip avatar Jul 19 '24 11:07 carsonip

thanks for reply. Hope that later versions will ignore this invalid payload and let the apm-server run normally.

leehow1988 avatar Jul 25 '24 06:07 leehow1988

Released v1.11.0 (https://github.com/elastic/apm-data/releases/tag/v1.11.0)

@rubvs let's wait for dependabot to pick up the new version and update it in apm-server 🙂

kruskall avatar Aug 28 '24 16:08 kruskall

Dependabot PR Merged: https://github.com/elastic/apm-server/pull/13950

kruskall avatar Aug 30 '24 11:08 kruskall

When can the new release be expected for apm-server?

matglas avatar Sep 02 '24 06:09 matglas

When can the new release be expected for apm-server?

@matglas, the fix is scheduled to be shipped in 8.15.2

rubvs avatar Sep 02 '24 17:09 rubvs

H I updated to 8.15.2 and I'm sill getting some "panic: interface conversion: interface {} is nil, not string" issues Is it possible the issue isn't entirely fixed ?

kikmon avatar Oct 09 '24 01:10 kikmon

Hi @kikmon, can you give some more details please; setup, input, output expected, error logs, etc. Thanks

rubvs avatar Oct 09 '24 12:10 rubvs

Setup is quite basic. Just running 8.15.2 windows version of ES and APM. Input is a Jenkins server using the opentelemetry plugin apm_server works fine with an empty test instance of Jenkins, but when fed with the real instance, apm_server panics after a few seconds. in the latest log, there seem to be some tags with empty value. Maybe that's what's making apm panic (would be nice to know what stat exactly made it panic)

Here's the full callstack if it helps panic: interface conversion: interface {} is nil, not string

goroutine 172 [running]: github.com/elastic/apm-data/input/otlp.setLabel({0xc0003a1600, 0x1b}, 0xc000a98000, {0x20b95c0?, 0xc000586a98?}) github.com/elastic/[email protected]/input/otlp/metadata.go:532 +0x5f1 github.com/elastic/apm-data/input/otlp.TranslateTransaction.func1({0xc00011b1a0, 0x1b}, {0xc00044a570?, 0xc0019b6b08?}) github.com/elastic/[email protected]/input/otlp/traces.go:295 +0x90a go.opentelemetry.io/collector/pdata/pcommon.Map.Range({0xc0003118e0?, 0xc0019b6b08?}, 0xc0006da968) go.opentelemetry.io/collector/[email protected]/pcommon/map.go:222 +0x97 github.com/elastic/apm-data/input/otlp.TranslateTransaction({0xc0003118e0?, 0xc0019b6b08?}, {0xc000311940?, 0xc0019b6b08?}, {0xc000344070?, 0xc0019b6b08?}, 0xc000a98000) github.com/elastic/[email protected]/input/otlp/traces.go:269 +0x2b7 github.com/elastic/apm-data/input/otlp.(*Consumer).convertSpan(0xc0002abb60, {0xc000311880?, 0xc0019b6b08?}, {0xc000344070?, 0xc0019b6b08?}, 0xc0004d6840, 0x0, 0xc0005869c0) github.com/elastic/[email protected]/input/otlp/traces.go:209 +0x8da github.com/elastic/apm-data/input/otlp.(*Consumer).convertScopeSpans(0xc0002abb60, {0xc000344070?, 0xc0019b6b08?}, 0xc0004d6840, 0x0, 0xc0005869c0) github.com/elastic/[email protected]/input/otlp/traces.go:151 +0x70 github.com/elastic/apm-data/input/otlp.(*Consumer).convertResourceSpans(0xc0002abb60, {0xc0004261e0?, 0xc0019b6b08?}, {0xc000a84e00?, 0x1?, 0x3cb9340?}, 0xc0005869c0) github.com/elastic/[email protected]/input/otlp/traces.go:139 +0x1ac github.com/elastic/apm-data/input/otlp.(*Consumer).ConsumeTracesWithResult(0xc0002abb60, {0x28dcfb8, 0xc00132c3f0}, {0xc00052e180?, 0xc0019b6b08?}) github.com/elastic/[email protected]/input/otlp/traces.go:115 +0x332 github.com/elastic/apm-server/internal/beater/otlp.(*tracesService).Export(0xc0004fe238, {0x28dcfb8, 0xc00132c3f0}, {0xc00052e180?, 0xc0019b6b08?}) github.com/elastic/apm-server/internal/beater/otlp/grpc.go:100 +0x9d go.opentelemetry.io/collector/pdata/ptrace/ptraceotlp.rawTracesServer.Export({{0x28d3690?, 0xc0004fe238?}}, {0x28dcfb8, 0xc00132c3f0}, 0xc00052e180) go.opentelemetry.io/collector/[email protected]/ptrace/ptraceotlp/grpc.go:89 +0xea go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1._TraceService_Export_Handler.func1({0x28dcfb8?, 0xc00132c3f0?}, {0x24309c0?, 0xc00052e180?}) go.opentelemetry.io/collector/[email protected]/internal/data/protogen/collector/trace/v1/trace_service.pb.go:310 +0xcb github.com/elastic/apm-server/internal/beater.(*Runner).Run.AnonymousRateLimit.func22({0x28dcfb8, 0xc00132c3f0}, {0x24309c0, 0xc00052e180}, 0x18d0165?, 0xc00052e300) github.com/elastic/apm-server/internal/beater/interceptors/ratelimit.go:59 +0x174 google.golang.org/grpc.getChainUnaryHandler.func1({0x28dcfb8, 0xc00132c3f0}, {0x24309c0, 0xc00052e180}) google.golang.org/[email protected]/server.go:1196 +0xb2 github.com/elastic/apm-server/internal/beater.(*Runner).Run.Auth.func21({0x28dcfb8, 0xc00132c330}, {0x24309c0, 0xc00052e180}, 0xc000f14560?, 0xc000a84dc0) github.com/elastic/apm-server/internal/beater/interceptors/auth.go:76 +0x16a google.golang.org/grpc.getChainUnaryHandler.func1({0x28dcfb8, 0xc00132c330}, {0x24309c0, 0xc00052e180}) google.golang.org/[email protected]/server.go:1196 +0xb2 github.com/elastic/apm-server/internal/beater.(*Runner).Run.Timeout.func20({0x28dcfb8?, 0xc00132c330?}, {0x24309c0?, 0xc00052e180?}, 0xc000f14560?, 0xc00052e300?) github.com/elastic/apm-server/internal/beater/interceptors/timeout.go:40 +0x2e google.golang.org/grpc.getChainUnaryHandler.func1({0x28dcfb8, 0xc00132c330}, {0x24309c0, 0xc00052e180}) google.golang.org/[email protected]/server.go:1196 +0xb2 github.com/elastic/apm-server/internal/beater/interceptors.Metrics.(*metricsInterceptor).Interceptor.func1({0x28dcfb8, 0xc00132c330}, {0x24309c0, 0xc00052e180}, 0xc000f14560, 0xc000a84d40) github.com/elastic/apm-server/internal/beater/interceptors/metrics.go:92 +0x26c google.golang.org/grpc.getChainUnaryHandler.func1({0x28dcfb8, 0xc00132c330}, {0x24309c0, 0xc00052e180}) google.golang.org/[email protected]/server.go:1196 +0xb2 github.com/elastic/apm-server/internal/beater.(*Runner).Run.Logging.func19({0x28dcfb8, 0xc00132c330}, {0x24309c0, 0xc00052e180}, 0xc000f14560, 0xc000a84c80) github.com/elastic/apm-server/internal/beater/interceptors/logging.go:54 +0x1ce google.golang.org/grpc.getChainUnaryHandler.func1({0x28dcfb8, 0xc00132c330}, {0x24309c0, 0xc00052e180}) google.golang.org/[email protected]/server.go:1196 +0xb2 github.com/elastic/apm-server/internal/beater.(*Runner).Run.ClientMetadata.func18({0x28dcfb8, 0xc00132c2d0}, {0x24309c0, 0xc00052e180}, 0xc000f14560?, 0xc000a84c40) github.com/elastic/apm-server/internal/beater/interceptors/metadata.go:66 +0x276 google.golang.org/grpc.getChainUnaryHandler.func1({0x28dcfb8, 0xc00132c2d0}, {0x24309c0, 0xc00052e180}) google.golang.org/[email protected]/server.go:1196 +0xb2 go.elastic.co/apm/module/apmgrpc/v2.NewUnaryServerInterceptor.func1({0x28dcfb8?, 0xc00132c2d0?}, {0x24309c0?, 0xc00052e180?}, 0xc000f14560?, 0xc00052e300?) go.elastic.co/apm/module/apmgrpc/[email protected]/server.go:71 +0xf6 google.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1({0x28dcfb8, 0xc00132c2d0}, {0x24309c0, 0xc00052e180}, 0xc000f14560, 0x80?) google.golang.org/[email protected]/server.go:1187 +0x85 go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1._TraceService_Export_Handler({0x21dd100, 0xc0003af190}, {0x28dcfb8, 0xc00132c2d0}, 0xc000286680, 0xc0003a86c0) go.opentelemetry.io/collector/[email protected]/internal/data/protogen/collector/trace/v1/trace_service.pb.go:312 +0x143 google.golang.org/grpc.(*Server).processUnaryRPC(0xc000314200, {0x28dcfb8, 0xc00132c210}, {0x28e7ba0, 0xc000638600}, 0xc000352360, 0xc0002eaf90, 0x3c7dba0, 0x0) google.golang.org/[email protected]/server.go:1379 +0xdf8 google.golang.org/grpc.(*Server).handleStream(0xc000314200, {0x28e7ba0, 0xc000638600}, 0xc000352360) google.golang.org/[email protected]/server.go:1790 +0xe8b google.golang.org/grpc.(*Server).serveStreams.func2.1() google.golang.org/[email protected]/server.go:1029 +0x8b created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 157 google.golang.org/[email protected]/server.go:1040 +0x125

kikmon avatar Oct 10 '24 06:10 kikmon

Hi @kikmon , sorry for the late response. The fix was not introduced in 8.15.2 and will be implemented in an upcoming release.

rubvs avatar Oct 15 '24 13:10 rubvs

Thanks, it explains a lot :)

kikmon avatar Oct 16 '24 08:10 kikmon