opal icon indicating copy to clipboard operation
opal copied to clipboard

Wait longer for inline OPA readiness

Open michivi opened this issue 1 year ago • 2 comments

(not really a bug, but I didn't how to label this)

Describe the bug On some occasions, the inline OPA agent the OPAL client is spawning needs more than 4s to start. Such occasions are for example the initial startup of a VM in which we've configured the OPAL client to start automatically at boot, among other demanding services. This causes the parent OPAL client to consider communications with OPA a failure after 2 attempts in a 4 s interval and does not retry until it actually needs to. This results in the OPA instance not being initialized with the initial policy / data in the initial startup of the VM.

To Reproduce This phenomenon is rather random and I guess very specific to our configuration so I'll try to describe it as best I can.

We have a VM with the following characteristics:

  • Several of our applications are deployed in it and configured to run on startup
  • The OPAL Client and OPA are deployed on the machine.
  • The OPAL client is configured to run at startup and to run the inline OPA.
  • The VM is configured to do various thing during initial startup (so only once during the VM lifecycle).

On the very first startup of the machine, we have the following logs. Some lines were removed for conciseness, and some sensitive info were changed such as server names. It shouldn't change the comprehension of the core issue, but do let me know if it does.

-- Starting OPAL client (with uvicron) --
2022-08-04T12:31:05.093437+0000 | opal_common.fetcher.fetcher_register    | INFO  | Loading FetcherProvider 'FastApiRpcFetchProvider' found at: <class 'opal_common.fetcher.providers.fastapi_rpc_fetch_provider.FastApiRpcFetchProvider'>
2022-08-04T12:31:05.093751+0000 | opal_common.fetcher.fetcher_register    | INFO  | Loading FetcherProvider 'HttpFetchProvider' found at: <class 'opal_common.fetcher.providers.http_fetch_provider.HttpFetchProvider'>
2022-08-04T12:31:05.093884+0000 | opal_common.fetcher.fetcher_register    | INFO  | Fetcher Register loaded
2022-08-04T12:31:05.095514+0000 | opal_client.callbacks.register          | INFO  | Callbacks register loaded
2022-08-04T12:31:05.095826+0000 | opal_client.client                      | INFO  | API authentication disabled (public encryption key was not provided)
2022-08-04T12:31:05.123967+0000 | opal_client.opa.runner                  | INFO  | Launching opa runner
2022-08-04T12:31:05.125077+0000 | opal_client.opa.runner                  | INFO  | Running OPA inline: opa run --server --addr=:8181 --authentication=off --authorization=off --log-level=info
2022-08-04T12:31:06.130890+0000 | opal_client.opa.runner                  | INFO  | Running OPA initial start callbacks
2022-08-04T12:31:06.131466+0000 | opal_client.data.updater                | INFO  | Launching data updater
2022-08-04T12:31:06.131766+0000 | opal_client.policy.updater              | INFO  | Launching policy updater
2022-08-04T12:31:06.131889+0000 | opal_client.data.updater                | INFO  | Subscribing to topics: ['policy_data/test']
2022-08-04T12:31:06.132246+0000 | opal_client.policy.updater              | INFO  | Subscribing to topics: ['policy:test/policies']
2022-08-04T12:31:06.132581+0000 | fastapi_websocket_pubsub.pub_sub_client | INFO  | Trying to connect to Pub/Sub server - wss://authz.server.local/ws
2022-08-04T12:31:06.133197+0000 | fastapi_websocket_rpc.websocket_rpc_c...| INFO  | Trying server - wss://authz.server.local/ws
2022-08-04T12:31:06.136757+0000 | fastapi_websocket_pubsub.pub_sub_client | INFO  | Trying to connect to Pub/Sub server - wss://authz.server.local/ws
2022-08-04T12:31:06.137473+0000 | fastapi_websocket_rpc.websocket_rpc_c...| INFO  | Trying server - wss://authz.server.local/ws
2022-08-04T12:31:06.176848+0000 | opal_client.policy.updater              | INFO  | Connected to server
2022-08-04T12:31:06.177363+0000 | opal_client.policy.updater              | INFO  | Refetching policy code (full bundle)
2022-08-04T12:31:06.181224+0000 | opal_client.data.updater                | INFO  | Connected to server
2022-08-04T12:31:06.181394+0000 | opal_client.data.updater                | INFO  | Performing data configuration, reason: Initial load
2022-08-04T12:31:06.181481+0000 | opal_client.data.updater                | INFO  | Getting data-sources configuration from 'https://authz.server.local/data/config'
2022-08-04T12:31:06.226193+0000 | opal_client.data.updater                | INFO  | Triggering data update with id: af256f421d114ca3a0041612abd1b93c
2022-08-04T12:31:06.226811+0000 | opal_client.data.updater                | INFO  | Fetching policy data
2022-08-04T12:31:06.227474+0000 | opal_client.policy.updater              | INFO  | Got policy bundle with 1 rego files, 0 data files, commit hash: '[...]'
2022-08-04T12:31:06.228748+0000 | opal_client.data.fetcher                | INFO  | Fetching data from url: http://localhost:8080/data1
2022-08-04T12:31:06.228989+0000 | opal_client.data.fetcher                | INFO  | Fetching data from url: http://localhost:8080/data2
2022-08-04T12:31:06.232986+0000 | fastapi_websocket_pubsub.pub_sub_client | INFO  | Connected to PubSub server wss://authz.server.local/ws
2022-08-04T12:31:06.235244+0000 | opal_client.policy_store.opa_client     |WARNING | Opa connection error: ClientConnectorError(ConnectionKey(host='localhost', port=8181, is_ssl=False, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=None), ConnectionRefusedError(111, 'Connection refused'))
2022-08-04T12:31:06.312655+0000 | opal_client.data.updater                | INFO  | Saving fetched data to policy-store: source url='http://localhost:8080/data1', destination path='/data'
2022-08-04T12:31:06.314126+0000 | opal_client.policy_store.opa_client     |WARNING | Opa connection error: ClientConnectorError(ConnectionKey(host='localhost', port=8181, is_ssl=False, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=None), ConnectionRefusedError(111, 'Connection refused'))
2022-08-04T12:31:08.237541+0000 | opal_client.policy_store.opa_client     |WARNING | Opa connection error: ClientConnectorError(ConnectionKey(host='localhost', port=8181, is_ssl=False, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=None), ConnectionRefusedError(111, 'Connection refused'))
2022-08-04T12:31:08.238108+0000 | opal_client...base_policy_store_client  |ERROR  | OPA transaction failed, transaction id=[...], actions=['set_policies'], error=RetryError(<Future at 0x7fde641d0cd0 state=finished raised ClientConnectorError>)
2022-08-04T12:31:08.238549+0000 | fastapi_websocket_rpc.websocket_rpc_c...|ERROR  | RPC Error
Traceback (most recent call last):
[...]
ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
[...]
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host localhost:8181 ssl:default [Connection refused]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
[...]
tenacity.RetryError: RetryError[<Future at 0x7fde641d0cd0 state=finished raised ClientConnectorError>]
2022-08-04T12:31:08.787553+0000 | opal_client.policy_store.opa_client     |WARNING | Opa connection error: ClientConnectorError(ConnectionKey(host='localhost', port=8181, is_ssl=False, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=None), ConnectionRefusedError(111, 'Connection refused'))
2022-08-04T12:31:08.787887+0000 | opal_client.data.updater                |ERROR  | Failed to save data update to policy-store
Traceback (most recent call last):
[...]
ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
[...]
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host localhost:8181 ssl:default [Connection refused]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
[...]
tenacity.RetryError: RetryError[<Future at 0x7fde641d11e0 state=finished raised ClientConnectorError>]
2022-08-04T12:31:09.084300+0000 | opal_client...base_policy_store_client  |ERROR  | OPA transaction failed, transaction id=af256f421d114ca3a0041612abd1b93c, actions=['set_policy_data'], error=RetryError(<Future at 0x7fde641d11e0 state=finished raised ClientConnectorError>)
2022-08-04T12:31:09.084546+0000 | asyncio.runners                         |ERROR  | Task exception was never retrieved
future: <Task finished name='Task-51' coro=<DataUpdater.update_policy_data() done, defined at /home/opal/.local/lib/python3.10/site-packages/opal_client/data/updater.py:253> exception=RetryError(<Future at 0x7fde641d11e0 state=finished raised ClientConnectorError>)>
Traceback (most recent call last):
[...]
ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
[...]
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host localhost:8181 ssl:default [Connection refused]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
[...]
tenacity.RetryError: RetryError[<Future at 0x7fde641d11e0 state=finished raised ClientConnectorError>]
2022-08-04T12:31:09.179509+0000 | fastapi_websocket_rpc.websocket_rpc_c...| INFO  | Trying server - wss://authz.server.local/ws
2022-08-04T12:31:09.204295+0000 | opal_client.policy.updater              | INFO  | Connected to server
2022-08-04T12:31:09.204807+0000 | opal_client.policy.updater              | INFO  | Refetching policy code (full bundle)
2022-08-04T12:31:09.256880+0000 | opal_client.policy.updater              | INFO  | Got policy bundle with 1 rego files, 0 data files, commit hash: '[...]'
2022-08-04T12:31:09.258653+0000 | opal_client.policy_store.opa_client     |WARNING | Opa connection error: ClientConnectorError(ConnectionKey(host='localhost', port=8181, is_ssl=False, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=None), ConnectionRefusedError(111, 'Connection refused'))
2022-08-04T12:31:09.755075+0000 | opal_client.opa.logger                  | INFO  | Initializing server. {"addrs": [":8181"], "diagnostic-addrs": [], "time": "2022-08-04T12:31:09Z"}
2022-08-04T12:31:11.260334+0000 | opal_client.opa.logger                  | INFO  | Received request.    {"client_addr": "127.0.0.1:57216", "req_id": 1, "req_method": "GET", "req_path": "/v1/policies", "time": "2022-08-04T12:31:11Z"}
2022-08-04T12:31:11.261571+0000 | opal_client.opa.logger                  | INFO  | Sent response.       {"client_addr": "127.0.0.1:57216", "req_id": 1, "req_method": "GET", "req_path": "/v1/policies", "resp_bytes": 13, "resp_duration": 1.181539, "resp_status": 200, "time": "2022-08-04T12:31:11Z"}
2022-08-04T12:31:11.264395+0000 | opal_client.opa.logger                  | INFO  | Received request.    {"client_addr": "127.0.0.1:57218", "req_id": 2, "req_method": "PUT", "req_path": "/v1/policies/test/policies/test.rego", "time": "2022-08-04T12:31:11Z"}
2022-08-04T12:31:11.269712+0000 | opal_client.opa.logger                  | INFO  | Sent response.       {"client_addr": "127.0.0.1:57218", "req_id": 2, "req_method": "PUT", "req_path": "/v1/policies/test/policies/test.rego", "resp_bytes": 2, "resp_duration": 5.073848, "resp_status": 200, "time": "2022-08-04T12:31:11Z"}
2022-08-04T12:31:11.270472+0000 | fastapi_websocket_pubsub.pub_sub_client | INFO  | Connected to PubSub server wss://authz.server.local/ws
2022-08-04T12:43:12.093722+0000 | opal_client.opa.logger                  | INFO  | Received request.    {"client_addr": "127.0.0.1:57220", "req_id": 3, "req_method": "POST", "req_path": "/v1/data/test/test/allow", "time": "2022-08-04T12:43:12Z"}
2022-08-04T12:43:12.095149+0000 | opal_client.opa.logger                  | INFO  | Sent response.       {"client_addr": "127.0.0.1:57220", "req_id": 3, "req_method": "POST", "req_path": "/v1/data/test/test/allow", "resp_bytes": 16, "resp_duration": 1.446647, "resp_status": 200, "time": "2022-08-04T12:43:12Z"}

The surprising part is that OPA is ready at 12:31:09 but was spawned at 12:31:05 so 4s later. In this scenario though, we can see that in the end OPAL has successfully sent to OPA the policy, but it did not retry to send the data.

Expected behavior We would have expected OPAL to either keep retrying more times or after a longer time (at least 3 times with exponential back-off of 5 s for example), or to suspend its initialization until it received confirmation from OPA (in inline mode) that it is fully started (Initialized server. out in the log).

I believe the current retry configuration is here: https://github.com/permitio/opal/blob/0.1.21/opal_client/policy_store/opa_client.py#L25

So 2 attempts with a fixed 2s interval.

In our use case, when waiting a few seconds before starting the OPAL client, we no longer have the issue:

2022-08-04T13:24:26.472802+0000 | opal_common.fetcher.fetcher_register    | INFO  | Loading FetcherProvider 'FastApiRpcFetchProvider' found at: <class 'opal_common.fetcher.providers.fastapi_rpc_fetch_provider.FastApiRpcFetchProvider'>
2022-08-04T13:24:26.473293+0000 | opal_common.fetcher.fetcher_register    | INFO  | Loading FetcherProvider 'HttpFetchProvider' found at: <class 'opal_common.fetcher.providers.http_fetch_provider.HttpFetchProvider'>
2022-08-04T13:24:26.473523+0000 | opal_common.fetcher.fetcher_register    | INFO  | Fetcher Register loaded
2022-08-04T13:24:26.475080+0000 | opal_client.callbacks.register          | INFO  | Callbacks register loaded
2022-08-04T13:24:26.475472+0000 | opal_client.client                      | INFO  | API authentication disabled (public encryption key was not provided)
2022-08-04T13:24:26.499465+0000 | opal_client.opa.runner                  | INFO  | Launching opa runner
2022-08-04T13:24:26.500735+0000 | opal_client.opa.runner                  | INFO  | Running OPA inline: opa run --server --addr=:8181 --authentication=off --authorization=off --log-level=info
2022-08-04T13:24:26.520453+0000 | opal_client.opa.logger                  | INFO  | Initializing server. {"addrs": [":8181"], "diagnostic-addrs": [], "time": "2022-08-04T13:24:26Z"}
2022-08-04T13:24:27.507117+0000 | opal_client.opa.runner                  | INFO  | Running OPA initial start callbacks
2022-08-04T13:24:27.507994+0000 | opal_client.data.updater                | INFO  | Launching data updater
2022-08-04T13:24:27.508324+0000 | opal_client.policy.updater              | INFO  | Launching policy updater

OPA is almost instantaneously available.

It's probable that the waiting time we added allow the other services to finish up initializing, allowing OPA to start up in time. Having the possibility to change the retry configuration may allow us to work around the issue. Waiting for OPA to be fully available might be another way.

What do you think about this? Not really an issue, but still annoying in some rare occasions. Do you see any way around this? Thanks for your help 🙂

Versions

  • Ubuntu 22.04.1
  • Python 3.10.4 as packaged by the distribution
  • OPAL client version 0.1.21 installed via pip
  • OPA version 0.43.0

michivi avatar Aug 04 '22 17:08 michivi

Hi @michivi ! Thanks for the detailed report- your desired outcome here makes total sense to me, and I think you pointed out the right spot for going about fixing this (the retry config).

I think increasing the amount of default retries makes sense (@asafc correct me if I'm missing something),

But better yet we can expose this into configurable value here e.g. POLICY_STORE_CONN_RETRY and use confi.model() (kinda like here ) to define the retry modules expected schema. That way everyone can tweak this to their needs .

@michivi - would you be interested in implement this with a PR ?

orweis avatar Aug 04 '22 17:08 orweis

Hi @orweis #282 is a basic implementation attempt. Python not being my primary language, I guess it could use some refinement 🙂

michivi avatar Aug 07 '22 18:08 michivi

It seems that #282 fixes the issue (through OPAL 0.2.1): we still have one or two initial "Connection refused" messages, but the retry policy kicks in and the client eventually connects to OPA and updates the policy successfully. Never had any issue during our 3 or 4 tests. Thanks!

michivi avatar Sep 06 '22 09:09 michivi

@michivi Thank you for your contribution :)

orweis avatar Sep 06 '22 11:09 orweis