Trace-Agent Fails to Start with Permission Denied Error After Upgrading to Datadog Agent `7.57.0`
Description
After upgrading to Datadog Agent 7.57.0 from 7.56.2, the trace-agent fails to start due to a permission error with the UDS listener, despite having datadog.apm.socketEnabled set to false.
Configuration
Here is the relevant portion of my values.yaml configuration:
targetSystem: linux
providers:
aks:
enabled: true
clusterAgent:
image:
doNotCheckTag: true
tag: 7.57.0
admissionController:
configMode: service
enabled: true
mutateUnlabelled: true
env:
- name: DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_INIT_SECURITY_CONTEXT
value: "{\"capabilities\":{\"drop\":[\"ALL\"]},\"runAsNonRoot\":true,\"runAsUser\":10000,\"readOnlyRootFilesystem\":true,\"allowPrivilegeEscalation\":false,\"seccompProfile\":{\"type\":\"RuntimeDefault\"}}"
datadog:
apiKeyExistingSecret: datadog-secret
site: datadoghq.com
apm:
portEnabled: true
instrumentation:
enabled: false
# I was thinking that this will disable socket and use hostip if socket disabled for trace-agent
socketEnabled: false
agents:
containers:
traceAgent:
securityContext:
runAsUser: 100
runAsNonRoot: true
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
image:
doNotCheckTag: true
tag: 7.57.0
Error Message
The trace-agent logs show the following error:
2024-09-09 17:29:59 UTC | TRACE | CRITICAL | (pkg/trace/api/api.go:712 in func2) | Error creating UDS listener: listen unix /var/run/datadog/apm.socket: bind: permission denied
Steps to Reproduce
- Upgrade Datadog Agent to version 7.57.0.
- Apply the above configuration.
- Observe that the
trace-agentfails to start with a permission denied error.
Additional Information
- Previous Version: 7.56.2 (worked fine)
- Environment: Kubernetes on AKS
-
Relevant Environment Variables:
DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_INIT_SECURITY_CONTEXT - I can see this new config parameter in the new version: trace_agent_socket
Request
Could you please help in diagnosing this issue or provide guidance on how to resolve the permission issue with the UDS listener for the trace-agent?
Not only on k8s, also on a RHEL 9 system here....
To expand on the issue a bit: The crash seems to be related to a lack of permissions on the trace-agent process (likely due to the explicit securityContext restrictions from the configuration). This is combined with a new feature from 7.57 that defines a default UDS listener on /var/run/datadog/apm.socket, and seeing that the directory exists, the trace-agent startup process attempts to create the listener (which fails). This failure is what led to the crash.
The fix we have put together on https://github.com/DataDog/datadog-agent/pull/29218 will make sure we don't crash on these circumstances, and just log the error, while continuing the agent startup.
Thanks @ichinaski / @FlorentClarret for the responses! I resolved the issue by updating the Helm chart with:
agents:
env:
# This works
- name: DD_APM_RECEIVER_SOCKET
value: "unix:///var/run/datadog/apm.socket"
# This does not work
# - name: DD_APM_RECEIVER_SOCKET
# value: "/var/run/datadog/apm.socket"
Even though /var/run/datadog/apm.socket is the default, specifying it without the unix:// prefix caused issues.
I also faced issues related to the new log launcher feature, which uses a JSON file under /opt/.... To fix it, I used different versions:
-
Cluster-agent: 7.57.0 (for
DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_INIT_SECURITY_CONTEXT) - Agent: 7.56.2 (to avoid permission errors).
For future CI stability, I recommend testing with unprivileged user IDs. I’ve raised issue #29286.
Closing this issue given the fix is now released.