Cross-process lock doesn't seem to work
We have a lock whose job it is to prevent both the main process and the NSE (notifications) process on Element X from using the OlmAccount at the same time. It also stops the two processes from performing an "encryption sync" (ie, a sliding-sync request in which we request to-device messages) concurrently.
However, we have evidence that it doesn't actually work. We have a rageshake where the main process makes a sync request:
2024-04-03T16:18:41.253319Z DEBUG matrix_sdk::http_client: Sending request
| crates/matrix-sdk/src/http_client/mod.rs:179
| spans:
root
next_sync_with_lock
> store_generation=26880
sync_once
> pos="29"
send
> server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5]
> config=RequestConfig { timeout: 60s }
> request_id="REQ-351"
> method=POST
> uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
> request_size="72 B"
> request_body=b"{\"conn_id\":\"encryption\",\"extensions\":{\"to_device\":{\"since\":\"58922486\"}}}"
...
2024-04-03T16:19:02.361031Z DEBUG matrix_sdk::http_client: Got response
| crates/matrix-sdk/src/http_client/mod.rs:185
| spans:
root
next_sync_with_lock
> store_generation=26880
sync_once
> pos="29"
send
> server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5]
> config=RequestConfig { timeout: 60s }
> request_id="REQ-351"
> method=POST
> uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
> request_size="72 B"
> request_body=b"{\"conn_id\":\"encryption\",\"extensions\":{\"to_device\":{\"since\":\"58922486\"}}}"
> status=200
> response_size="1.4 kiB"
... and overlapping with that, the NSE process makes another /sync request:
2024-04-03T16:18:59.168928Z DEBUG matrix_sdk::http_client: Sending request
| crates/matrix-sdk/src/http_client/mod.rs:179
| spans:
root
get_notification
> room_id="!wMBBpsZzEENxZzGCoY:matrix.org"
> event_id="$Bydvzcl4X70NQyfw00EIeIWd3fN_Rh8rxqxdYDxLUFw"
retry_decryption
run_fixed_iterations
> store_generation=26881
sync_once
send
> server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6, V1_7, V1_8, V1_9]
> config=RequestConfig { timeout: 7s }
> request_id="REQ-209"
> method=POST
> uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
> request_size="155 B"
> request_body=b"{\"conn_id\":\"encryption\",\"txn_id\":\"6cad88800b6540f589143dc318fde96a\",\"extensions\":{\"to_device\":{\"enabled\":true,\"since\":\"58922486\"},\"e2ee\":{\"enabled\":true}}}"
...
2024-04-03T16:18:59.328466Z DEBUG matrix_sdk::http_client: Got response
| crates/matrix-sdk/src/http_client/mod.rs:185
| spans:
root
get_notification
> room_id="!wMBBpsZzEENxZzGCoY:matrix.org"
> event_id="$Bydvzcl4X70NQyfw00EIeIWd3fN_Rh8rxqxdYDxLUFw"
retry_decryption
run_fixed_iterations
> store_generation=26881
sync_once
send
> server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6, V1_7, V1_8, V1_9]
> config=RequestConfig { timeout: 7s }
> request_id="REQ-209"
> method=POST
> uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
> request_size="155 B"
> request_body=b"{\"conn_id\":\"encryption\",\"txn_id\":\"6cad88800b6540f589143dc318fde96a\",\"extensions\":{\"to_device\":{\"enabled\":true,\"since\":\"58922486\"},\"e2ee\":{\"enabled\":true}}}"
> status=200
> response_size="1.6 kiB"
This shouldn't be possible, because both operations are done holding the cross-process lock (the main process in next_sync_with_lock, and the NSE process in run_fixed_iterations).
Now, I also see this in the log from the main process, in the middle of that next_sync_with_lock operation:
2024-04-03T16:18:51.391736Z INFO matrix_sdk_common::store_locks: exiting the lease extension loop
| crates/matrix-sdk-common/src/store_locks.rs:237
| spans:
root
... which I believe means we are dropping the cross-process lock. That would obviously explain a lot of things, but I can't figure out how it happens.
I'm not really sure how best to debug this. We could probably start by turning on all the trace logging in CrossProcessStoreLock. It might also be informative to give each CrossProcessStoreLockGuard a sequence number, and log when each is allocated and dropped.
I'll try to check.
I think I repro'd this. It's flakey though. https://github.com/matrix-org/complement-crypto/actions/runs/9094842308/job/24996764717?pr=52#step:15:790 shows TestMultiprocessDupeOTKUpload failing. That test:
- Creates Alice and Bob and join them to a room
- Bob sends a message. This consumes one of Alice's OTKs.
- Alice sees the OTK get consumed in /sync and tries to upload a new OTK.
- This OTK is tarpitted in the test, artificially delaying the response by 4 seconds.
- During this gap, an NSE process is made and
NotificationClient.GetNotificationis called. - If the lock isn't working, the NSE process will also do /sync, see the OTK get consumed, and upload a different key with the same key ID, causing an HTTP 400 to be returned.
This is exactly what the test shows: notification_test.go:466: /keys/upload returned an error, duplicate key upload? POST http://hs1:8008/_matrix/client/v3/keys/upload (token=syt_dXNlci0zNS1hbGljZQ_YwyfILDCncfsOAbiZBSL_3CgJtE) req_len=538 => HTTP 400
I'll try to come back to this next week to see if I can get a reliable repro case.