opal opal client does not always update policy data

Describe the bug opal client does not always update policy data in a multiple instance setup.

2 opal servers with postgres as broadcast channel
2 opal clients
both clients and servers are deployed in k8s

To Reproduce

deploy opal server and client to k8s
post policy data update to opal server's /config/data endpoint
observe one of the client instances did not receive policy data update

Please also include:

server 1 log

2022-05-09T18:45:01.334815+0000 | opal_server.data.data_update_publisher | INFO | [11] Publishing data update to topics: ['policy_data'], reason: None, entries: [('REDACTED', 'PUT', '/')] 2022-05-09T18:45:01.335203+0000 | uvicorn.protocols.http.httptools_impl | INFO | 10.188.120.123:50826 - "POST /data/config HTTP/1.1" 200 2022-05-09T18:45:01.336200+0000 | fastapi_websocket_pubsub.event_notifier | INFO | New subscription {'id': '3aac3d2d6145408dbd0193ed891eba14', 'subscriber_id': '22144c1cfb224f33a121b1b165af6256', 'topic': 'EventNotifier_ALL_TOPICS', 'callback': <bound method EventBroadcaster.broadcast_notifications of <fastapi_websocket_pubsub.event_broadcaster.EventBroadcaster object at 0x7f62ffaf99a0>>, 'notifier_id': None} 2022-05-09T18:45:01.336509+0000 | fastapi_websocket_pubsub.event_notifier | INFO | calling subscription callbacks: topic=policy_data (ALL_TOPICS), subscription_id=3aac3d2d6145408dbd0193ed891eba14, subscriber_id=22144c1cfb224f33a121b1b165af6256 2022-05-09T18:45:01.336668+0000 | fastapi_websocket_pubsub.event_broadc...| INFO | Broadcasting incoming event: {'topic': 'policy_data', 'notifier_id': '22144c1cfb224f33a121b1b165af6256'} 2022-05-09T18:45:01.604053+0000 | fastapi_websocket_pubsub.event_notifier | INFO | Removing Subscription of topic='EventNotifier_ALL_TOPICS' for subscriber=22144c1cfb224f33a121b1b165af6256

server 2 log

2022-05-09T18:45:01.602723+0000 | fastapi_websocket_pubsub.event_broadc...| INFO | Handling incoming broadcast event: {'topics': ['policy_data'], 'src': '22144c1cfb224f33a121b1b165af6256'} 2022-05-09T18:45:01.603099+0000 | fastapi_websocket_pubsub.event_notifier | INFO | calling subscription callbacks: topic=policy_data, subscription_id=5240d9ad555e4f06900c1107ac85a885, subscriber_id=fa512a4afe2741a494067e4bdddad3a4 2022-05-09T18:45:01.603381+0000 | fastapi_websocket_pubsub.rpc_event_me...| INFO | Notifying other side: {'subscriber_id': 'fa512a4afe2741a494067e4bdddad3a4', 'subscription_id': '5240d9ad555e4f06900c1107ac85a885', 'topic': 'policy_data'}

client 1 log

2022-05-09T18:45:01.605106+0000 | opal_client.data.rpc | INFO | Received notification of event: policy_data 2022-05-09T18:45:01.605271+0000 | opal_client.data.updater | INFO | Updating policy data, reason: None 2022-05-09T18:45:01.605461+0000 | opal_client.data.updater | INFO | Triggering data update with id: 980b954b0e054d0f98bc5a12a147ade8 2022-05-09T18:45:01.605552+0000 | opal_client.data.updater | INFO | Fetching policy data 2022-05-09T18:45:01.605663+0000 | opal_client.data.fetcher | INFO | Fetching data from url: REDACTED 2022-05-09T18:45:01.693665+0000 | opal_client.data.updater | INFO | Saving fetched data to policy-store: source url='REDACTED', destination path='/' 2022-05-09T18:45:01.695160+0000 | opal_client.opa.logger | INFO | Received request. PUT /v1/data 2022-05-09T18:45:01.695783+0000 | opal_client.opa.logger | INFO | Sent response. PUT /v1/data -> 204 2022-05-09T18:45:01.696268+0000 | opal_client.policy_store.opa_client | INFO | processing store transaction: {'id': '980b954b0e054d0f98bc5a12a147ade8', 'actions': ['set_policy_data'], 'transaction_type': <TransactionType.data: 'data'>, 'success': True, 'error': '', 'creation_time': '2022-05-09T18:45:01.693537', 'end_time': '2022-05-09T18:45:01.696087', 'remotes_status': [{'remote_url': 'REDACTED', 'succeed': True, 'error': 'None'}]} 2022-05-09T18:45:01.696403+0000 | opal_client.policy_store.opa_client | INFO | persisting health check policy: ready=true, healthy=true 2022-05-09T18:45:01.696463+0000 | opal_client.policy_store.opa_client | INFO | Policy and data statistics: policy: (successful 1, failed 0); data: (successful 367, failed 1)

there's nothing in client 2
server config

OPAL_POLICY_REPO_URL=REDACTED OPAL_POLICY_REPO_MAIN_BRANCH=REDACTED OPAL_POLICY_REPO_SSH_KEY=REDACTED OPAL_POLICY_REPO_POLLING_INTERVAL=30 UVICORN_NUM_WORKERS=4 OPAL_DATA_CONFIG_SOURCES="{"config":{"entries":[{"url":"REDACTED/policy-data","topics":["policy_data"]}]}}" OPAL_BROADCAST_URI=REDACTED

client config

OPAL_SERVER_URL=REDACTED OPAL_INLINE_OPA_LOG_FORMAT=http OPAL_OPA_HEALTH_CHECK_POLICY_ENABLED=true

Expected behavior both opal clients pick up the policy data update on every request

Screenshots N/A

OPAL version

Version: 0.1.21

May 09 '22 19:05 tyowen

Hi @tyowen! welcome to the OPAL community :)

I am missing some data in your report, please provide the following details:

full OPAL server config (you can redact secrets but there could be many things that went wrong: two of the most common issues are not configuring the broadcast channel correctly or not configuring OPAL_DATA_CONFIG_SOURCES correctly)
full OPAL server logs (except secrets)
did you immediately sent the data update after OPAL server was up? (did you wait for it to stabilize - i.e: finish cloning the policy).
did you wait a long time to send the data update? the broadcaster might fail if your postgres is configured to terminate idle sessions quickly.
does it work consistently with 1 worker and 1 server pod?

May 09 '22 20:05 asafc

hi @asafc thanks for the reply. i need to clarify something before answering your questions. the behaviour is not consistent. when i post data update, sometimes both clients pick up the update, other times only one does.

both server and client's configs were posted in the report. was there anything else you were looking for?
the full logs were pretty large. was there any particular section you were looking for? like when the application started?
i did not immediately send the data update. and the frequency for data updates is pretty low ~15min
broadcaster succeeded based on the logs i posted
1 client and 1 server works

hope these help. thanks

May 10 '22 01:05 tyowen

Hi @tyowen, i cannot be sure without the logs (you can send them to me privately on slack if that is an issue) but i have a strong suspicion.

We have observed a bug with the broadcaster library that happens only when you are running with multiple workers and with postgresql as a backbone, where the DB session used for LISTEN/NOTIFY terminates on the postgres side and does not terminate on the library side. It happens rarely, only with a certain configuration of Postgres and usually after a certain time passes from booting the OPAL server.

We are currently working on a fix to that issue :)

May 10 '22 06:05 asafc

Hello, Any update on this bug? I have the exact same issue. Using your helmchart, i am deploying 3 clients (having all the same topic), 3 postgresql and 2 servers on a kubernetes infra. When i trigger the POST /data/config, sometime 1-2 or 3 of the clients are notify. Until now i was working on staging with 1 replicate on each component, but for a prod deployment it's a NOGO if the cluster mode isn't working properly. I will investigate. Do you have more clue/suspicion to share @asafc ?

Oct 03 '22 13:10 loan75

HI @loan75 - I believe this issue was fixed on most recent version of OPAL (> 0.2.0), and this ticket shoudl actually be closed. @asafc correct me if I'm wrong

Are you setting up a broadcaster / Backbone pub/sub, and if so which one (Kafka, Postgres, Redis) ?

Oct 03 '22 19:10 orweis

Hi @loan75,

this should be fixed in the latest version of OPAL, can you please add this config to OPAL server and let us know if it solved your issue:

OPAL_BROADCAST_CONN_LOSS_BUGFIX_EXPERIMENT_ENABLED=true

are you using postgresql as broadcast backbone?

Oct 03 '22 20:10 asafc

Hello asafc, At the moment its using postgresql as broadcast backbone (default setup in the helmchart, so i expect it to be the most commun setup). Ok will add this ENV and see how it behave.

Oct 03 '22 21:10 loan75

Hi, i added this Env but i got the same behavior. Opal is in 0.2.1, i check and the env/fix is present in the opal server. I will put all that in debug and investigate during the next few days.

Oct 04 '22 15:10 loan75

Hi @asafc was this issue ever resolved? The websocket connections keep dropping on v0.18. Want to upgrade to the right version if so.

Feb 12 '24 19:02 yadunandan004

This is almost definitely resolved by @roekatz , i am not sure if he merged it yet to the main branch. The fix is either already merged, or exists in this PR that is about to get merged: https://github.com/permitio/opal/pull/546

cc @roekatz

Feb 13 '24 08:02 asafc

There should be no issues with broadcasting updates across instances in recent OPAL releases.

@asafc The fixes introduced by that PR are of a more nuanced nature. @yadunandan004 Yes I would say you should definitely upgrade :) For now I'm closing the ticket - If you continue to experience issues with WebSocket connections, we would greatly appreciate it if you could report a new issue.

Feb 13 '24 08:02 roekatz