opal icon indicating copy to clipboard operation
opal copied to clipboard

opal client does not always update policy data

Open tyowen opened this issue 3 years ago • 8 comments

Describe the bug opal client does not always update policy data in a multiple instance setup.

  • 2 opal servers with postgres as broadcast channel
  • 2 opal clients
  • both clients and servers are deployed in k8s

To Reproduce

  • deploy opal server and client to k8s
  • post policy data update to opal server's /config/data endpoint
  • observe one of the client instances did not receive policy data update

Please also include:

  • server 1 log

2022-05-09T18:45:01.334815+0000 | opal_server.data.data_update_publisher | INFO | [11] Publishing data update to topics: ['policy_data'], reason: None, entries: [('REDACTED', 'PUT', '/')] 2022-05-09T18:45:01.335203+0000 | uvicorn.protocols.http.httptools_impl | INFO | 10.188.120.123:50826 - "POST /data/config HTTP/1.1" 200 2022-05-09T18:45:01.336200+0000 | fastapi_websocket_pubsub.event_notifier | INFO | New subscription {'id': '3aac3d2d6145408dbd0193ed891eba14', 'subscriber_id': '22144c1cfb224f33a121b1b165af6256', 'topic': 'EventNotifier_ALL_TOPICS', 'callback': <bound method EventBroadcaster.broadcast_notifications of <fastapi_websocket_pubsub.event_broadcaster.EventBroadcaster object at 0x7f62ffaf99a0>>, 'notifier_id': None} 2022-05-09T18:45:01.336509+0000 | fastapi_websocket_pubsub.event_notifier | INFO | calling subscription callbacks: topic=policy_data (ALL_TOPICS), subscription_id=3aac3d2d6145408dbd0193ed891eba14, subscriber_id=22144c1cfb224f33a121b1b165af6256 2022-05-09T18:45:01.336668+0000 | fastapi_websocket_pubsub.event_broadc...| INFO | Broadcasting incoming event: {'topic': 'policy_data', 'notifier_id': '22144c1cfb224f33a121b1b165af6256'} 2022-05-09T18:45:01.604053+0000 | fastapi_websocket_pubsub.event_notifier | INFO | Removing Subscription of topic='EventNotifier_ALL_TOPICS' for subscriber=22144c1cfb224f33a121b1b165af6256

  • server 2 log

2022-05-09T18:45:01.602723+0000 | fastapi_websocket_pubsub.event_broadc...| INFO | Handling incoming broadcast event: {'topics': ['policy_data'], 'src': '22144c1cfb224f33a121b1b165af6256'} 2022-05-09T18:45:01.603099+0000 | fastapi_websocket_pubsub.event_notifier | INFO | calling subscription callbacks: topic=policy_data, subscription_id=5240d9ad555e4f06900c1107ac85a885, subscriber_id=fa512a4afe2741a494067e4bdddad3a4 2022-05-09T18:45:01.603381+0000 | fastapi_websocket_pubsub.rpc_event_me...| INFO | Notifying other side: {'subscriber_id': 'fa512a4afe2741a494067e4bdddad3a4', 'subscription_id': '5240d9ad555e4f06900c1107ac85a885', 'topic': 'policy_data'}

  • client 1 log

2022-05-09T18:45:01.605106+0000 | opal_client.data.rpc | INFO | Received notification of event: policy_data 2022-05-09T18:45:01.605271+0000 | opal_client.data.updater | INFO | Updating policy data, reason: None 2022-05-09T18:45:01.605461+0000 | opal_client.data.updater | INFO | Triggering data update with id: 980b954b0e054d0f98bc5a12a147ade8 2022-05-09T18:45:01.605552+0000 | opal_client.data.updater | INFO | Fetching policy data 2022-05-09T18:45:01.605663+0000 | opal_client.data.fetcher | INFO | Fetching data from url: REDACTED 2022-05-09T18:45:01.693665+0000 | opal_client.data.updater | INFO | Saving fetched data to policy-store: source url='REDACTED', destination path='/' 2022-05-09T18:45:01.695160+0000 | opal_client.opa.logger | INFO | Received request. PUT /v1/data 2022-05-09T18:45:01.695783+0000 | opal_client.opa.logger | INFO | Sent response. PUT /v1/data -> 204 2022-05-09T18:45:01.696268+0000 | opal_client.policy_store.opa_client | INFO | processing store transaction: {'id': '980b954b0e054d0f98bc5a12a147ade8', 'actions': ['set_policy_data'], 'transaction_type': <TransactionType.data: 'data'>, 'success': True, 'error': '', 'creation_time': '2022-05-09T18:45:01.693537', 'end_time': '2022-05-09T18:45:01.696087', 'remotes_status': [{'remote_url': 'REDACTED', 'succeed': True, 'error': 'None'}]} 2022-05-09T18:45:01.696403+0000 | opal_client.policy_store.opa_client | INFO | persisting health check policy: ready=true, healthy=true 2022-05-09T18:45:01.696463+0000 | opal_client.policy_store.opa_client | INFO | Policy and data statistics: policy: (successful 1, failed 0); data: (successful 367, failed 1)

  • there's nothing in client 2
  • server config

OPAL_POLICY_REPO_URL=REDACTED OPAL_POLICY_REPO_MAIN_BRANCH=REDACTED OPAL_POLICY_REPO_SSH_KEY=REDACTED OPAL_POLICY_REPO_POLLING_INTERVAL=30 UVICORN_NUM_WORKERS=4 OPAL_DATA_CONFIG_SOURCES="{"config":{"entries":[{"url":"REDACTED/policy-data","topics":["policy_data"]}]}}" OPAL_BROADCAST_URI=REDACTED

  • client config

OPAL_SERVER_URL=REDACTED OPAL_INLINE_OPA_LOG_FORMAT=http OPAL_OPA_HEALTH_CHECK_POLICY_ENABLED=true

Expected behavior both opal clients pick up the policy data update on every request

Screenshots N/A

OPAL version

  • Version: 0.1.21

tyowen avatar May 09 '22 19:05 tyowen

Hi @tyowen! welcome to the OPAL community :)

I am missing some data in your report, please provide the following details:

  • full OPAL server config (you can redact secrets but there could be many things that went wrong: two of the most common issues are not configuring the broadcast channel correctly or not configuring OPAL_DATA_CONFIG_SOURCES correctly)
  • full OPAL server logs (except secrets)
  • did you immediately sent the data update after OPAL server was up? (did you wait for it to stabilize - i.e: finish cloning the policy).
  • did you wait a long time to send the data update? the broadcaster might fail if your postgres is configured to terminate idle sessions quickly.
  • does it work consistently with 1 worker and 1 server pod?

asafc avatar May 09 '22 20:05 asafc

hi @asafc thanks for the reply. i need to clarify something before answering your questions. the behaviour is not consistent. when i post data update, sometimes both clients pick up the update, other times only one does.

  • both server and client's configs were posted in the report. was there anything else you were looking for?
  • the full logs were pretty large. was there any particular section you were looking for? like when the application started?
  • i did not immediately send the data update. and the frequency for data updates is pretty low ~15min
  • broadcaster succeeded based on the logs i posted
  • 1 client and 1 server works

hope these help. thanks

tyowen avatar May 10 '22 01:05 tyowen

Hi @tyowen, i cannot be sure without the logs (you can send them to me privately on slack if that is an issue) but i have a strong suspicion.

We have observed a bug with the broadcaster library that happens only when you are running with multiple workers and with postgresql as a backbone, where the DB session used for LISTEN/NOTIFY terminates on the postgres side and does not terminate on the library side. It happens rarely, only with a certain configuration of Postgres and usually after a certain time passes from booting the OPAL server.

We are currently working on a fix to that issue :)

asafc avatar May 10 '22 06:05 asafc

Hello, Any update on this bug? I have the exact same issue. Using your helmchart, i am deploying 3 clients (having all the same topic), 3 postgresql and 2 servers on a kubernetes infra. When i trigger the POST /data/config, sometime 1-2 or 3 of the clients are notify. Until now i was working on staging with 1 replicate on each component, but for a prod deployment it's a NOGO if the cluster mode isn't working properly. I will investigate. Do you have more clue/suspicion to share @asafc ?

loan75 avatar Oct 03 '22 13:10 loan75

HI @loan75 - I believe this issue was fixed on most recent version of OPAL (> 0.2.0), and this ticket shoudl actually be closed. @asafc correct me if I'm wrong

Are you setting up a broadcaster / Backbone pub/sub, and if so which one (Kafka, Postgres, Redis) ?

orweis avatar Oct 03 '22 19:10 orweis

Hi @loan75,

this should be fixed in the latest version of OPAL, can you please add this config to OPAL server and let us know if it solved your issue:

OPAL_BROADCAST_CONN_LOSS_BUGFIX_EXPERIMENT_ENABLED=true

are you using postgresql as broadcast backbone?

asafc avatar Oct 03 '22 20:10 asafc

Hello asafc, At the moment its using postgresql as broadcast backbone (default setup in the helmchart, so i expect it to be the most commun setup). Ok will add this ENV and see how it behave.

loan75 avatar Oct 03 '22 21:10 loan75

Hi, i added this Env but i got the same behavior. Opal is in 0.2.1, i check and the env/fix is present in the opal server. I will put all that in debug and investigate during the next few days.

loan75 avatar Oct 04 '22 15:10 loan75

Hi @asafc was this issue ever resolved? The websocket connections keep dropping on v0.18. Want to upgrade to the right version if so.

yadunandan004 avatar Feb 12 '24 19:02 yadunandan004

This is almost definitely resolved by @roekatz , i am not sure if he merged it yet to the main branch. The fix is either already merged, or exists in this PR that is about to get merged: https://github.com/permitio/opal/pull/546

cc @roekatz

asafc avatar Feb 13 '24 08:02 asafc

There should be no issues with broadcasting updates across instances in recent OPAL releases.

@asafc The fixes introduced by that PR are of a more nuanced nature. @yadunandan004 Yes I would say you should definitely upgrade :) For now I'm closing the ticket - If you continue to experience issues with WebSocket connections, we would greatly appreciate it if you could report a new issue.

roekatz avatar Feb 13 '24 08:02 roekatz