datadog-agent icon indicating copy to clipboard operation
datadog-agent copied to clipboard

[BUG] serverless-init agent swallows any errors thrown by underlying code

Open ajmayes opened this issue 1 year ago • 9 comments

Agent Environment serverless-init:1

Describe what happened: The JAR kicked off by the bash startup script throws exit code 1, serverless-init throws exit code 0.

Describe what you expected: I would expect serverless-init to throw exit code 1, to indicate that the application run was unsuccessful. This is important for Google Cloud Run to know if the Job in this specific instance should be re-run.

Steps to reproduce the issue:

  1. Make simple Spring Boot application that will fail on startup.
  2. Create startup script like this one:

#!/bin/bash

exec java ${JAVA_OPTS:-} -jar /opt/application/app.jar

  1. Use Dockerfile like the following to initialize:

FROM azul/zulu-openjdk:17-latest

RUN groupadd application && useradd -g application application

COPY --from=gcr.io/datadoghq/serverless-init:1 /datadog-init /app/datadog-init ADD https://dtdg.co/latest-java-tracer /dd_tracer/java/dd-java-agent.jar

RUN chown -R application /app/datadog-init /dd_tracer/java/dd-java-agent.jar

USER application

COPY ./build/libs/cloud-run-task-example-0.0.1-SNAPSHOT.jar /opt/application/app.jar COPY startApp.sh /opt/application/startApp.sh

ENTRYPOINT ["/app/datadog-init"] CMD ["/opt/application/startApp.sh"]

  1. Notice that the application exits with Exit code 1, but the Docker container exits with exit code 0. I.E. serverless agent will capture the exit code but does nothing with it

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | Error exiting: exit status 1

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | INFO | Triggering a flush in the logs-agent

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | Flush in the logs-agent done.

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | finished flushing

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | Received a Flush trigger

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | Demultiplexer: sendIterableSeries: start sending iterable series to the serializer

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | The payload was not too big, returning the full payload

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | SyncForwarder has flushed 1 transactions

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | Demultiplexer: sendIterableSeries: stop routine

as can be seen in this code: initcontainer.Run()

Additional environment details (Operating System, Cloud provider, etc): Google Cloud Run, Java (Spring Boot)

ajmayes avatar Jan 05 '24 02:01 ajmayes

same problem, exactly as described, it means that CloudRun jobs always report success even when the task failed

tjosgood avatar Feb 18 '24 13:02 tjosgood

same problem, exactly as described

sjmunoz avatar Mar 26 '24 15:03 sjmunoz

Same issue, looking for workarounds

ilkerc avatar Mar 28 '24 10:03 ilkerc

A similar issue was addressed in a bug fix on version

1.1.2
Fixes propagation of OS signals

https://hub.docker.com/r/datadog/serverless-init

can you try to update to that version or greater and check if it still happens?

alexgallotta avatar Mar 28 '24 17:03 alexgallotta

It looks like that fix went out 5 months ago, I've been using latest this entire time and the issue is there.

ajmayes avatar Mar 28 '24 17:03 ajmayes

Thanks for confirming, we will add to our issue list and look into that as soon as possible!

alexgallotta avatar Mar 28 '24 18:03 alexgallotta

Maybe this can motivate;

Waking up everyday, checking that issue's state, One of these days, D'dog will fix my fate. Hoping this verse I penned will accelerate, Before my code turns into something we all hate.

ilkerc avatar May 04 '24 10:05 ilkerc

I reached out to Datadog support and created a ticket for this. The response was that they didn't officially support Cloud Run Jobs. Which is a bit confusing to me since in GCP they are grouped together, so you would assume its kind of the same. And you get some cloud run jobs specific metrics. I don't know how this affects "normal" cloud run applications. Anyways they had an open feature request to support this, so they added me to the list of interested in hopes of this getting prioritized.

crea1 avatar Jun 17 '24 13:06 crea1

Running into this issue also using Cloud Run Jobs and datadog/serverless-init.

Would be great if datadog-init could preserve the status code of the subprocess it wraps.

jfgreen-liberis avatar Jul 02 '24 13:07 jfgreen-liberis

Hi all! I wanted to share that we currently have a fix in progress to return an exit code of 1 if there is an error during the application run. I'll share here once we've released a version of serverless-init that does not swallow errors.

https://github.com/DataDog/datadog-agent/pull/27259

Update: instead of always returning exit code 1 on an error serverless-init will attempt to propagate an exit code if one is available

duncanpharvey avatar Jul 09 '24 20:07 duncanpharvey

serverless-init v1.2.5 is released as of today! With this version moving forward, exit codes from an instrumented application will be propagated by serverless-init. Please feel free to reopen this issue if anyone encounters unexpected behavior related to this feature.

https://hub.docker.com/r/datadog/serverless-init/tags

duncanpharvey avatar Jul 22 '24 15:07 duncanpharvey

Just tested it, it's working. Thanks Duncan!

ajmayes avatar Jul 22 '24 16:07 ajmayes