opal
opal copied to clipboard
OPAL Client memory consumption over time increase
Describe the bug We have OPAL client sidecar that fetches each 30 seconds policies and 1 datasource which gives around 20MB json that is fetched once in an hour. When you start the pod everything looks fine but over time (possibly each data fetch) OPAL starts consuming more and more memory. It starts from 200-300 MB from the beginning for OPAL and over time increased. When pod is not queried at all - it is around 300-350 MB. When you start querying OPA it raises to around 420MB when OPA caches (maybe) data and for OPA it is stable over time, but overall pod memory consumption continues to grow. For some pods that have 2 datasources (20MB json and 8MB json) OPAL client pod consumes up to 1,5 Gigs of RAM with 2-4 RPS load which seems to be very strange. And again - OPA memory consumption remains stable. OPA metrics are exported to prometheus, and even proc/ folder check shows that it is opal which consumes the most of RAM.
Screenshots are taken for different pods but the main idea is still shown.
To Reproduce Run OPAL subscribed to datasource and policies fetching. Datasource should return around 20MB of data and fairly small policy. Datasource of opal server configured to refetch whole data bundle every hour. Policies should be fetched every 30 seconds. Leave pod for about 24 hours and observe increased amount of memory. Logs are completely empty and do not show anything interesting.
OPAL Client configuration
opal-client:
Ports: 7000/TCP, 8777/TCP
Host Ports: 0/TCP, 0/TCP
State: Running
Started: Mon, 24 Feb 2025 10:58:17 +0200
Ready: True
Restart Count: 0
Limits:
cpu: 1500m
memory: 3Gi
Requests:
cpu: 1500m
memory: 1Gi
Liveness: exec [/bin/bash -c response=$(curl -X GET http://localhost:7000/healthcheck | jq -r '.status'); if [ "$response" != "ok" ]; then
echo "Liveness probe. Ping Opal Client failed 📛";
exit 1;
else
echo "Liveness probe. Ping Opal Client success 🚀";
exit 0;
fi
] delay=0s timeout=10s period=30s #success=1 #failure=5
Readiness: exec [/bin/bash -c response=$(curl -X GET http://localhost:7000/healthcheck | jq -r '.status'); if [ "$response" != "ok" ]; then
echo "Readiness probe. Ping Opal Client failed 📛";
exit 1;
else
echo "Readiness probe. Ping Opal Client success 🚀";
exit 0;
fi
] delay=5s timeout=10s period=15s #success=1 #failure=5
Environment:
GOMEMLIMIT: 1073741824 (requests.memory)
GOMAXPROCS: 2 (requests.cpu)
OPAL_POLICY_STORE_POLICY_PATHS_TO_IGNORE: backups/**
OPAL_ENABLE_OPENTELEMETRY_TRACING: true
OPAL_ENABLE_OPENTELEMETRY_METRICS: true
OPAL_OPENTELEMETRY_OTLP_ENDPOINT: jaeger-collector.jaeger.svc.cluster.local:4317
OPAL_LOG_FORMAT: {time:YYYY-MM-DD HH:mm:ss.SSS}|<level>{level:^6} | {process} | {name: <40} | {message}</level>
{exception}
OPAL_SERVER_URL: http://opal-server.namespace.svc.cluster.local:7002
OPAL_DATA_UPDATER_ENABLED: True
OPAL_LOG_FORMAT_INCLUDE_PID: true
OPAL_LOG_LEVEL: INFO
OPAL_INLINE_OPA_LOG_FORMAT: full
OPAL_STATISTICS_ENABLED: true
OPAL_INLINE_OPA_CONFIG: {
"addr": ":8777",
"authentication": "off",
"authorization": "off",
"log_level": "info",
"files": [],
"config_file": "/opa-config/config.yaml"
}
OPAL_INLINE_CEDAR_CONFIG: {
"addr": ":8777",
"authentication": "off",
"authentication_token": "None",
"files": []
}
OPAL_LOG_COLORIZE: false
OPAL_SHOULD_REPORT_ON_DATA_UPDATES: False
OPAL_OPA_HEALTH_CHECK_POLICY_ENABLED: True
OPAL_OFFLINE_MODE_ENABLED: True
OPAL_POLICY_STORE_URL: http://localhost:8777
OPAL_POLICY_UPDATER_ENABLED: True
OPAL_POLICY_SUBSCRIPTION_DIRS: tenants/common-libs/policies:tenants/domain/policies
OPAL_OPAL_CLIENT_STAT_ID: 3718139b-c462-4b0d-bec7-02f615143a27
OPAL_LOG_TRACEBACK: FALSE
OPAL_CLIENT_TOKEN: <client_token>
OPAL_DEFAULT_UPDATE_CALLBACK_CONFIG: {
"method": "post",
"headers": {
"Authorization": "Bearer <token>",
"content-type": "application/json"
},
"process_data": "false"
}
OPAL_DATA_TOPICS: ad-datasource
Expected behavior Memory consumption should be stable.
OPAL version OPAL Server version - 0.7.16 OPAL Client version - 0.8.0
Thanks for sharing this We will create a ticket and send it to prioritization If you have more data like suspected functions or even have a lead and you want to open a PR and contribute to OSS this is always appreciated
Ticket id 11896
Update.
We were checking pod consumption during the day and recently - over the night. As you can see consumption slowly grows.
Pod was running for 22 hours without restarts
OPA consumption remains flat line
Attaching available logs
If we spinup the same container without datasources subscribed - memory consumption does not increase. Deviations are mostly due to OPA memory usage.
And again, no traffic on both pods.
@obsd the only thing we can think that can cause this problem is fetching/updating or backing up mechanism.
Is there any progress on this? It is kinda urgent for us as we already have this issue on our production environment.
Hi, Any update on this one?