matrix-rust-sdk icon indicating copy to clipboard operation
matrix-rust-sdk copied to clipboard

Cross-process lock doesn't seem to work

Open richvdh opened this issue 2 years ago • 4 comments

We have a lock whose job it is to prevent both the main process and the NSE (notifications) process on Element X from using the OlmAccount at the same time. It also stops the two processes from performing an "encryption sync" (ie, a sliding-sync request in which we request to-device messages) concurrently.

However, we have evidence that it doesn't actually work. We have a rageshake where the main process makes a sync request:

2024-04-03T16:18:41.253319Z DEBUG matrix_sdk::http_client: Sending request 
  | crates/matrix-sdk/src/http_client/mod.rs:179 
  | spans: 
    root
    next_sync_with_lock
      > store_generation=26880
    sync_once
      > pos="29"
    send
      > server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5]
      > config=RequestConfig { timeout: 60s }
      > request_id="REQ-351"
      > method=POST
      > uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
      > request_size="72 B"
      > request_body=b"{\"conn_id\":\"encryption\",\"extensions\":{\"to_device\":{\"since\":\"58922486\"}}}"
...
2024-04-03T16:19:02.361031Z DEBUG matrix_sdk::http_client: Got response 
  | crates/matrix-sdk/src/http_client/mod.rs:185 
  | spans: 
    root
    next_sync_with_lock
      > store_generation=26880
    sync_once
      > pos="29"
    send
      > server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5]
      > config=RequestConfig { timeout: 60s }
      > request_id="REQ-351"
      > method=POST
      > uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
      > request_size="72 B"
      > request_body=b"{\"conn_id\":\"encryption\",\"extensions\":{\"to_device\":{\"since\":\"58922486\"}}}"
      > status=200
      > response_size="1.4 kiB"

... and overlapping with that, the NSE process makes another /sync request:

2024-04-03T16:18:59.168928Z DEBUG matrix_sdk::http_client: Sending request 
  | crates/matrix-sdk/src/http_client/mod.rs:179 
  | spans: 
    root
    get_notification
      > room_id="!wMBBpsZzEENxZzGCoY:matrix.org"
      > event_id="$Bydvzcl4X70NQyfw00EIeIWd3fN_Rh8rxqxdYDxLUFw"
    retry_decryption
    run_fixed_iterations
      > store_generation=26881
    sync_once
    send
      > server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6, V1_7, V1_8, V1_9]
      > config=RequestConfig { timeout: 7s }
      > request_id="REQ-209"
      > method=POST
      > uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
      > request_size="155 B"
      > request_body=b"{\"conn_id\":\"encryption\",\"txn_id\":\"6cad88800b6540f589143dc318fde96a\",\"extensions\":{\"to_device\":{\"enabled\":true,\"since\":\"58922486\"},\"e2ee\":{\"enabled\":true}}}"
...
2024-04-03T16:18:59.328466Z DEBUG matrix_sdk::http_client: Got response 
  | crates/matrix-sdk/src/http_client/mod.rs:185 
  | spans: 
    root
    get_notification
      > room_id="!wMBBpsZzEENxZzGCoY:matrix.org"
      > event_id="$Bydvzcl4X70NQyfw00EIeIWd3fN_Rh8rxqxdYDxLUFw"
    retry_decryption
    run_fixed_iterations
      > store_generation=26881
    sync_once
    send
      > server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6, V1_7, V1_8, V1_9]
      > config=RequestConfig { timeout: 7s }
      > request_id="REQ-209"
      > method=POST
      > uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
      > request_size="155 B"
      > request_body=b"{\"conn_id\":\"encryption\",\"txn_id\":\"6cad88800b6540f589143dc318fde96a\",\"extensions\":{\"to_device\":{\"enabled\":true,\"since\":\"58922486\"},\"e2ee\":{\"enabled\":true}}}"
      > status=200
      > response_size="1.6 kiB"

This shouldn't be possible, because both operations are done holding the cross-process lock (the main process in next_sync_with_lock, and the NSE process in run_fixed_iterations).

Now, I also see this in the log from the main process, in the middle of that next_sync_with_lock operation:

2024-04-03T16:18:51.391736Z  INFO matrix_sdk_common::store_locks: exiting the lease extension loop 
  | crates/matrix-sdk-common/src/store_locks.rs:237 
  | spans: 
    root

... which I believe means we are dropping the cross-process lock. That would obviously explain a lot of things, but I can't figure out how it happens.

richvdh avatar Apr 09 '24 15:04 richvdh

I'm not really sure how best to debug this. We could probably start by turning on all the trace logging in CrossProcessStoreLock. It might also be informative to give each CrossProcessStoreLockGuard a sequence number, and log when each is allocated and dropped.

richvdh avatar Apr 09 '24 15:04 richvdh

I'll try to check.

Hywan avatar Apr 18 '24 09:04 Hywan

I think I repro'd this. It's flakey though. https://github.com/matrix-org/complement-crypto/actions/runs/9094842308/job/24996764717?pr=52#step:15:790 shows TestMultiprocessDupeOTKUpload failing. That test:

  • Creates Alice and Bob and join them to a room
  • Bob sends a message. This consumes one of Alice's OTKs.
  • Alice sees the OTK get consumed in /sync and tries to upload a new OTK.
  • This OTK is tarpitted in the test, artificially delaying the response by 4 seconds.
  • During this gap, an NSE process is made and NotificationClient.GetNotification is called.
  • If the lock isn't working, the NSE process will also do /sync, see the OTK get consumed, and upload a different key with the same key ID, causing an HTTP 400 to be returned.

This is exactly what the test shows: notification_test.go:466: /keys/upload returned an error, duplicate key upload? POST http://hs1:8008/_matrix/client/v3/keys/upload (token=syt_dXNlci0zNS1hbGljZQ_YwyfILDCncfsOAbiZBSL_3CgJtE) req_len=538 => HTTP 400

kegsay avatar May 15 '24 12:05 kegsay

I'll try to come back to this next week to see if I can get a reliable repro case.

kegsay avatar Jun 28 '24 16:06 kegsay