opal
opal copied to clipboard
Possible memory leak with OPAL client
I'm running an OPAL client container in Kubernetes. I recently looked at metrics on the memory usage of this container and noticed that memory is climbing and climbing until it hits 100%. This indicates a possible memory leak with OPAL client.
![Screen Shot 2021-09-07 at 3 37 39 PM](https://user-images.githubusercontent.com/4023140/132407783-3ab90eda-39ef-4ffa-acb0-3568a7d3024f.png)
Here is some more info:
I'm running the image authorizon/opal-client-standalone:0.1.12
.
Here's my env var config:
- name: OPAL_POLICY_STORE_URL
value: http://localhost:8181/v1
- name: OPAL_SERVER_URL
value: http://localhost:7002
- name: OPAL_DATA_UPDATER_ENABLED
value: "0"
- name: OPAL_KEEP_ALIVE_INTERVAL
value: "10"
And finally the command to run the container:
command:
- "sh"
- "-c"
- "/usr/wait-for.sh http://localhost:7002 --timeout=20 -- /start.sh"
Hi Grant! :) Thanks for reporting we'll take a look.
Few things that could help:
- Can you please share if and how much data you loaded via OPAL ? (perhaps by looking at the OPAL logs)
- Can you glance in the container - and see which process is utilizing most of the memory (Python [OPAL] or OPA ?)
CC: @asafc
Hi Grant :) i would also please ask if you could send us the complete logs for:
- opal server
- opal client
- OPA agent
You bet, @orweis!
I have data loading disabled. The client has only loaded 2 policies from the server.
Also, this is using the authorizon/opal-client-standalone
image so OPA is not running on that container; only the OPAL client.
@grantmcconnaughey just making sure you saw my comment, we need the logs to be able to help you out :)
Yes, understood. I ran into several hours of meetings but will work on getting those logs together.
Here ya go, @asafc.
The client just has a bunch of healthcheck probes:
opal-client logs
[2021-09-07 21:12:06 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2021-09-07 21:12:06 +0000] [1] [INFO] Listening at: http://0.0.0.0:7000 (1)
[2021-09-07 21:12:06 +0000] [1] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2021-09-07 21:12:06 +0000] [13] [INFO] Booting worker with pid: 13
2021-09-07T21:12:06.618925+0000 |opal_client.policy.updater | INFO | Launching policy updater
2021-09-07T21:12:06.619150+0000 |opal_client.policy.updater | INFO | Subscribing to topics: ['policy:.']
2021-09-07T21:12:06.619339+0000 |fastapi_websocket_pubsub.pub_sub_client | INFO | Trying to connect to Pub/Sub server - ws://localhost:7002/ws
2021-09-07T21:12:06.619547+0000 |fastapi_websocket_rpc.websocket_rpc_c...| INFO | Trying server - ws://localhost:7002/ws
2021-09-07T21:12:06.633847+0000 |opal_client.policy.updater | INFO | Connected to server
2021-09-07T21:12:06.633997+0000 |opal_client.policy.updater | INFO | Refetching policy code (full bundle)
2021-09-07T21:12:06.649489+0000 |opal_client.policy.updater | INFO | got policy bundle, commit hash: '63e34db42659293f925ae81dcd5b892a3b3c9715'
2021-09-07T21:12:06.661916+0000 |fastapi_websocket_pubsub.pub_sub_client | INFO | Connected to PubSub server ws://localhost:7002/ws
2021-09-07T21:12:19.842831+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:40024 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:24.842649+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:40252 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:25.858691+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:40292 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:29.842395+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:40512 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:30.858768+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:40572 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:34.842403+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:40770 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:35.858835+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:40808 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:39.842480+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:41032 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:40.858993+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:41100 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:44.842416+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:41304 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:45.859095+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:41348 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:49.842426+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:41598 - "GET /healthcheck HTTP/1.1" 200
The opal-server is in a similar situation:
opal-server logs
[2021-09-07 21:12:05 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2021-09-07 21:12:05 +0000] [1] [INFO] Listening at: http://0.0.0.0:7002 (1)
[2021-09-07 21:12:05 +0000] [1] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2021-09-07 21:12:05 +0000] [7] [INFO] Booting worker with pid: 7
2021-09-07T21:12:05.868385+0000 |opal_common.authentication.signer | INFO | OPAL was not provided with JWT encryption keys, cannot verify api requests!
2021-09-07T21:12:05.903231+0000 |opal_server.server | INFO | triggered startup event
2021-09-07T21:12:05.903386+0000 |opal_common.topics.publisher | INFO | started topic publisher
2021-09-07T21:12:05.903564+0000 |opal_server.server | INFO | leadership lock acquired, leader pid: 7
2021-09-07T21:12:05.903635+0000 |opal_server.server | INFO | listening on webhook topic: 'webhook'
2021-09-07T21:12:05.904117+0000 |fastapi_websocket_pubsub.event_notifier | INFO | New subscription {'id': 'a41035dbaa8d4d82a6ed851e04bccc19', 'subscriber_id': 'ea9cca504c5d4818936d1d4cc3aa
ee47', 'topic': 'webhook', 'callback': functools.partial(<function trigger_repo_watcher_pull at 0x7fda5ddb43a0>, <opal_server.policy.watcher.task.RepoWatcherTask object at 0x7fda5ddc30a0>),
'notifier_id': None}
2021-09-07T21:12:05.904206+0000 |opal_server.policy.watcher.task | INFO | Launching repo watcher
2021-09-07T21:12:05.904587+0000 |opal_common.git.repo_cloner | INFO | Cloning repo from 'https://oauth2:[email protected]/foo/bar.git' to '/regoclo
ne'
2021-09-07T21:12:05.909174+0000 |uvicorn.protocols.http.httptools_impl | INFO | 127.0.0.1:38294 - "GET / HTTP/1.1" 200
2021-09-07T21:12:06.494139+0000 |opal_common.git.repo_cloner | INFO | Clone succeeded
2021-09-07T21:12:06.498029+0000 |opal_common.git.repo_watcher | INFO | Polling task is off
2021-09-07T21:12:06.624958+0000 |fastapi_websocket_rpc.websocket_rpc_e...| INFO | Client connected
2021-09-07T21:12:06.631769+0000 |fastapi_websocket_pubsub.event_notifier | INFO | New subscription {'id': 'c940050ded3f4d50a5884078588f15ce', 'subscriber_id': 'ec6c4eb590e249b594e4cd33bd5b
5e90', 'topic': 'policy:.', 'callback': <function RpcEventServerMethods.subscribe.<locals>.callback at 0x7fda5db3a700>, 'notifier_id': None}
2021-09-07T21:12:06.648527+0000 |uvicorn.protocols.http.httptools_impl | INFO | 127.0.0.1:38404 - "GET /policy?path=. HTTP/1.1" 200
2021-09-07T21:12:24.214110+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:60590 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:28.734494+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:60820 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:29.213933+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:60846 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:33.734618+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:32848 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:34.213902+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:32870 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:38.734621+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:33102 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:39.213910+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:33130 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:43.734557+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:33384 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:44.213884+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:33410 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:48.734498+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:33680 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:49.213854+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:33704 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:53.734488+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:33924 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:54.213895+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:33954 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:58.734434+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:34228 - "GET /healthcheck HTTP/1.1" 200
2021-09-07T21:12:59.213908+0000 |uvicorn.protocols.http.httptools_impl | INFO | 10.81.69.3:34266 - "GET /healthcheck HTTP/1.1" 200
And finally the OPA container itself:
OPA logs
{"client_addr":"10.81.69.3:43524","level":"info","msg":"Received request.","req_id":17572,"req_method":"GET","req_path":"/health","time":"2021-09-08T09:24:00Z"}
{"client_addr":"10.81.69.3:43524","level":"info","msg":"Sent response.","req_id":17572,"req_method":"GET","req_path":"/health","resp_bytes":2,"resp_duration":0.235926,"resp_status":200,"tim
e":"2021-09-08T09:24:00Z"}
{"client_addr":"10.81.69.3:43594","level":"info","msg":"Received request.","req_id":17573,"req_method":"GET","req_path":"/","time":"2021-09-08T09:24:01Z"}
{"client_addr":"10.81.69.3:43594","level":"info","msg":"Sent response.","req_id":17573,"req_method":"GET","req_path":"/","resp_bytes":1391,"resp_duration":0.111187,"resp_status":200,"time":
"2021-09-08T09:24:01Z"}
{"client_addr":"10.81.69.3:43850","level":"info","msg":"Received request.","req_id":17574,"req_method":"GET","req_path":"/health","time":"2021-09-08T09:24:05Z"}
{"client_addr":"10.81.69.3:43850","level":"info","msg":"Sent response.","req_id":17574,"req_method":"GET","req_path":"/health","resp_bytes":2,"resp_duration":0.247038,"resp_status":200,"tim
e":"2021-09-08T09:24:05Z"}
{"client_addr":"10.81.69.3:43924","level":"info","msg":"Received request.","req_id":17575,"req_method":"GET","req_path":"/","time":"2021-09-08T09:24:06Z"}
{"client_addr":"10.81.69.3:43924","level":"info","msg":"Sent response.","req_id":17575,"req_method":"GET","req_path":"/","resp_bytes":1391,"resp_duration":0.117939,"resp_status":200,"time":
"2021-09-08T09:24:06Z"}
{"client_addr":"10.81.69.3:44218","level":"info","msg":"Received request.","req_id":17576,"req_method":"GET","req_path":"/health","time":"2021-09-08T09:24:10Z"}
{"client_addr":"10.81.69.3:44218","level":"info","msg":"Sent response.","req_id":17576,"req_method":"GET","req_path":"/health","resp_bytes":2,"resp_duration":0.247871,"resp_status":200,"tim
e":"2021-09-08T09:24:10Z"}
{"client_addr":"10.81.69.3:44266","level":"info","msg":"Received request.","req_id":17577,"req_method":"GET","req_path":"/","time":"2021-09-08T09:24:11Z"}
{"client_addr":"10.81.69.3:44266","level":"info","msg":"Sent response.","req_id":17577,"req_method":"GET","req_path":"/","resp_bytes":1391,"resp_duration":0.125741,"resp_status":200,"time":
"2021-09-08T09:24:11Z"}
Unfortunately the logs don't tell us much.. maybe you can set OPAL_LOG_LEVEL=DEBUG
and try again?
Let me try to reproduce this myself in my own k8s setup. I'll try to get to it tomorrow or early next week :)
Sounds good! I'll throw a DEBUG
in there on my end and see if anything else turns up.
The logs aren't showing anything useful, but I do want to report that we're still seeing spiky memory for OPAL client containers. This is maximum memory for the past 2 weeks. So far we're just syncing one test policy in a small repo. We're only syncing policies with OPAL, not data.
Hi @grantmcconnaughey - we really want to investigate deeper and solve this properly. Would you mind sharing with us a redacted / minified version of your kubernetes config with OPAL so we can run it locally and investigate? You can leave anything non-OPAL outside obviously :)
Our production deployment is currently ECS on AWS, so any help reproducing this will be greatly appreciated! :)
You bet, @asafc! I'm working on getting that info for you.
@asafc Here is some redacted Kubernetes config. Hopefully that helps!
App K8s config
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: '2021-10-18T18:36:20Z'
generateName: appname-web-12345abc-
labels:
app: web
pod-template-hash: 12345abc
release: appname
name: appname-web-12345abc-6gfqb
namespace: appname
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: appname-web-12345abc
uid: 3a9aa7e6-dd81-4ba7-8911-74fc0868609a
resourceVersion: '12345'
selfLink: /api/v1/namespaces/appname/pods/appname-web-12345-6gfqb
uid: 0e0c4aee-0136-4f49-b2fd-de899ab78798
spec:
containers:
- env:
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: COMMIT_SHA
value: 12345
- name: ENV
value: production
- name: PORT
value: '8000'
- name: DOMAIN
value: foo.com
- name: CDN_DOMAIN
value: cdn.foo.com
- name: APPNAME_IDENTITY_URL
value: 'https://foo.com'
- name: APPNAME_OPA_BASE_URL
value: 'http://localhost:8181'
envFrom:
- secretRef:
name: secretname
image: '12345.dkr.ecr.us-east-1.amazonaws.com/appname:12345'
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /liveness
port: 8000
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: web
ports:
- containerPort: 8000
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /readiness
port: 8000
scheme: HTTP
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
limits:
memory: 3Gi
requests:
cpu: 300m
memory: 3Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- args:
- run
- '--server'
- '--addr'
- ':8181'
image: 'openpolicyagent/opa:0.31.0'
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /
port: 8181
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: opa
ports:
- containerPort: 8181
name: opa
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health?bundle=true
port: 8181
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- command:
- sh
- '-c'
- '/usr/wait-for.sh http://localhost:7002 --timeout=20 -- /start.sh'
env:
- name: OPAL_POLICY_STORE_URL
value: 'http://localhost:8181/v1'
- name: OPAL_SERVER_URL
value: 'http://localhost:7002'
- name: OPAL_DATA_UPDATER_ENABLED
value: '0'
- name: OPAL_KEEP_ALIVE_INTERVAL
value: '10'
- name: OPAL_LOG_LEVEL
value: DEBUG
- name: OPAL_POLICY_SUBSCRIPTION_DIRS
value: policies
image: 'authorizon/opal-client-standalone:0.1.12'
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthcheck
port: 7000
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: opal-client
ports:
- containerPort: 7000
name: opal-client
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthcheck
port: 7000
scheme: HTTP
initialDelaySeconds: 20
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
memory: 256Mi
requests:
cpu: 100m
memory: 256Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- env:
- name: OPAL_POLICY_REPO_MAIN_BRANCH
value: main
- name: OPAL_LOG_LEVEL
value: DEBUG
- name: STAKATER_OPAL_COMMON_SECRET
value: b2622a1c376c44d6cb8d141ffa29e03437d1a49f
envFrom:
- secretRef:
name: secretname
image: 'authorizon/opal-server:0.1.12'
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthcheck
port: 7002
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: opal-server
ports:
- containerPort: 7002
name: opal-server
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthcheck
port: 7002
scheme: HTTP
initialDelaySeconds: 20
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- args:
- '--app.mapping-config=/tmp/app_mapping.yml'
- '--web.listen-address=:9102'
image: 'prom/app-exporter:v0.18.0'
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /metrics
port: exporter
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: app-exporter
ports:
- containerPort: 9125
name: importer
protocol: TCP
- containerPort: 9102
name: exporter
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /metrics
port: exporter
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 100m
memory: 72Mi
requests:
cpu: 20m
memory: 72Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /tmp/app_mapping.yml
name: app-mapping-yml
subPath: app_mapping.yml
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-q6jn6
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-1-1-1-1.ec2.internal
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- configMap:
defaultMode: 420
items:
- key: app_mapping.yml
path: app_mapping.yml
name: appname-web-app-exporter
name: app-mapping-yml
Hey @grantmcconnaughey, this is super helpful! we are on it! :)
Hey @grantmcconnaughey, just letting you know we did not forget about this issue :)
We are working on a few time-critical bugs first to enable one of our users to go to production with OPAL this week. They indicated some urgency with their timeline so we are prioritizing a bit.
If you want you can reach out to me here or on slack to communicate your time constraints and i'll do my best to accommodate. In any case - i will let you know how this issue progresses :)
Thanks for the heads up!
Sure @grantmcconnaughey :)
I have started to investigate - i have a few leads & some follow up questions about your setup:
- How often is the policy repo being updated? are you syncing with webhook? polling? no syncing at all?
- You are running OPAL standalone - is OPA already running when you run opal client?
- Is OPA ever being restarted during OPAL client's runtime?
- Are you still running 0.1.12? did you try to run the latest version?
- Did you try to run OPAL client non-standalone (with built-in OPA) and see if the issue persists?
In any case i am trying to zero in with a profiler - will update as soon as i have something concrete :)
Closing because old, possibly fixed with: https://github.com/permitio/opal/issues/274