kalium
kalium copied to clipboard
feat(mls): Add users to out-of-sync groups [WPB-20862]
|
|
PR Submission Checklist for internal contributors
-
The PR Title
- [X] conforms to the style of semantic commits messagesΒΉ supported in Wire's Github WorkflowΒ²
- [X] contains a reference JIRA issue number like
SQPIT-764 - [X] answers the question: If merged, this PR will: ... Β³
-
The PR Description
- [X] is free of optional paragraphs and you have filled the relevant parts to the best of your ability
What's new in this PR?
Issues
There is a new feature on the DS, where users can be added to a conversation without being added to the MLS group itself.
When a user sends a message or commit, it will be rejected with a mls-group-out-of-sync response, and a list of missing user IDs.
Solutions
The new response parsing was already added, now I am adding the code to actually handle the logic.
- Add
MLSMissingUsersMessageRejectionHandlerto resolve GroupOutOfSync errors when sending messages by adding missing users to MLS groups. - Update
MessageSenderto retry sending messages after handling missing users. - Update the
Retrylogic when failing to send commits
[!NOTE] To make writing tests easier without having to rely on mocking libs or creating huge fakes, I've created fun interfaces, like
ConversationProtocolGetterandMLSMemberAdderinstead of having to rely on big interfaces likeConversationRepositoryorMLSConversationRepository
Testing
Test Coverage
- [X] I have added automated test to this contribution
Test Results
0 testsβ β-β4β095βββ0 β β-β3β984βββ0s β±οΈ - 5m 40s 0 suites β-βββ689βββ0 π€ β-βββ111β 0 filesββ β-βββ689βββ0 β Β±ββββ0β
Results for commit 3d0a466e.βΒ± Comparison against base commit 8476fc0e.
:recycle: This comment has been updated with latest results.
Bencher Report
| Branch | mls/feat/add-users-to-out-of-sync-groups |
| Testbed | ubuntu-latest |
β οΈ WARNING: No Threshold found!
Without a Threshold, no Alerts will ever be generated.
Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the--ci-only-thresholdsflag.
Click to view all benchmark results
| Benchmark | Latency | microseconds (Β΅s) |
|---|---|---|
| com.wire.kalium.benchmarks.logic.CoreLogicBenchmark.createObjectInFiles | π view plot β οΈ NO THRESHOLD | 555.02 Β΅s |
| com.wire.kalium.benchmarks.logic.CoreLogicBenchmark.createObjectInMemory | π view plot β οΈ NO THRESHOLD | 424,667.66 Β΅s |
| com.wire.kalium.benchmarks.persistence.MessagesNoPragmaTuneBenchmark.messageInsertionBenchmark | π view plot β οΈ NO THRESHOLD | 1,431,446.44 Β΅s |
| com.wire.kalium.benchmarks.persistence.MessagesNoPragmaTuneBenchmark.queryMessagesBenchmark | π view plot β οΈ NO THRESHOLD | 28,990.19 Β΅s |
Codecov Report
:x: Patch coverage is 76.08696% with 33 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 57.32%. Comparing base (8476fc0) to head (3d0a466).
Additional details and impacted files
@@ Coverage Diff @@
## develop #3712 +/- ##
===========================================
+ Coverage 57.29% 57.32% +0.02%
===========================================
Files 1791 1794 +3
Lines 59879 59965 +86
Branches 5865 5880 +15
===========================================
+ Hits 34309 34376 +67
- Misses 23136 23146 +10
- Partials 2434 2443 +9
| Files with missing lines | Coverage Ξ | |
|---|---|---|
| .../com/wire/kalium/logic/data/e2ei/E2EIRepository.kt | 80.19% <100.00%> (ΓΈ) |
|
| ...data/mls/MLSMissingUsersMessageRejectionHandler.kt | 100.00% <100.00%> (ΓΈ) |
|
| ...ure/conversation/RecoverMLSConversationsUseCase.kt | 78.12% <100.00%> (ΓΈ) |
|
| .../kalium/logic/feature/message/MessageSenderImpl.kt | 95.60% <100.00%> (+0.09%) |
:arrow_up: |
| ...kalium/logic/feature/message/StaleEpochVerifier.kt | 100.00% <100.00%> (ΓΈ) |
|
| ...r/conversation/message/MLSMessageFailureHandler.kt | 16.66% <0.00%> (ΓΈ) |
|
| .../com/wire/kalium/logic/feature/debug/DebugScope.kt | 0.00% <0.00%> (ΓΈ) |
|
| .../wire/kalium/logic/feature/message/MessageScope.kt | 0.00% <0.00%> (ΓΈ) |
|
| ...lin/com/wire/kalium/logic/data/MockProtocolInfo.kt | 50.00% <50.00%> (ΓΈ) |
|
| ...alium/logic/data/mls/ConversationProtocolGetter.kt | 0.00% <0.00%> (ΓΈ) |
|
| ... and 1 more |
... and 3 files with indirect coverage changes
Continue to review full report in Codecov by Sentry.
Legend - Click here to learn more
Ξ = absolute <relative> (impact),ΓΈ = not affected,? = missing dataPowered by Codecov. Last update 8476fc0...3d0a466. Read the comment docs.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
@claude review
Code Review for PR #3712: Add users to out-of-sync groups WPB-20862
Summary
This PR implements functionality to handle MLS group out-of-sync errors by automatically adding missing users when a message is rejected by the backend. The implementation is well-structured with good separation of concerns through functional interfaces.
β Strengths
1. Excellent Architecture & Design
- Functional Interfaces: Great use of
MLSMemberAdderandConversationProtocolGetterfor focused, testable abstractions - Single Responsibility:
MLSMissingUsersMessageRejectionHandlerhas a clear, focused purpose - Separation of Concerns: Clean separation between message sending, error handling, and MLS operations
2. Comprehensive Test Coverage
- β
New handler has dedicated test file (
MLSMissingUsersMessageRejectionHandlerTest.kt) - β
Integration tests in
MessageSenderTest.ktupdated - β Test arrangement helpers created for easier testing
- β Edge cases considered (empty user lists, protocol mismatches)
3. Good Error Handling
- Proper use of
Eithermonad for error propagation - Appropriate logging at different levels (
logger.i,logger.w,logger.d) - Clear error messages with context
4. Code Quality
- Consistent Kotlin idioms
- Good use of exhaustive when expressions
- Proper visibility modifiers (
internal) - Clear naming conventions
β οΈ Issues & Concerns
1. Recursive Retry Logic Risk π΄ CRITICAL
Location: MLSConversationDataSource.kt:803-826
CommitStrategy.AddMissingUsers -> {
// ... adds users ...
retryOperation().flatMapLeft {
handleCommitFailure( // β οΈ Recursive call
mlsContext = mlsContext,
failure = it,
groupID = groupID,
remainingAttempts = remainingAttempts - 1, // β οΈ Not decremented for initial AddMissingUsers
// ...
)
}
}
Problem: When AddMissingUsers is triggered, it retries the operation and then calls handleCommitFailure again with the same remainingAttempts value. This means:
- If adding users succeeds but the retry fails with another
GroupOutOfSync, you'll attempt to add users again - This could lead to infinite loops or many more retries than intended
- The retry counter doesn't account for the
AddMissingUsersstrategy attempt
Recommendation: Decrement remainingAttempts when entering the AddMissingUsers path:
is CommitStrategy.AddMissingUsers -> {
wrapStorageRequest { conversationDAO.getConversationByGroupID(groupID.value) }.flatMap {
// ... get cipher suite ...
}.flatMap { cipherSuite ->
addMemberToMLSGroup(mlsContext, groupID, strategy.users, cipherSuite)
}.flatMap {
retryOperation()
}.flatMapLeft {
handleCommitFailure(
mlsContext = mlsContext,
failure = it,
groupID = groupID,
remainingAttempts = remainingAttempts - 1, // β
Decrement here
retryOnClientMismatch = retryOnClientMismatch,
retryOnStaleMessage = retryOnStaleMessage,
retryOperation = retryOperation
)
}
}
2. Potential Infinite Loop on Persistent GroupOutOfSync π‘ MEDIUM
Location: MLSConversationDataSource.kt:803-826
Problem: If the backend consistently returns GroupOutOfSync (e.g., due to a bug or race condition), the code will:
- Add missing users
- Retry
- Get
GroupOutOfSyncagain (possibly with different users) - Add those users
- Repeat until
remainingAttemptsexhausted
This could result in multiple user additions that may not be intended.
Recommendation: Track which users have already been added and avoid re-adding:
// In handleCommitFailure signature
suspend fun <T> handleCommitFailure(
// ... existing params ...
alreadyAddedUsers: Set<UserId> = emptySet() // Track added users
): Either<CoreFailure, T>
// In AddMissingUsers branch
is CommitStrategy.AddMissingUsers -> {
val usersToAdd = strategy.users.filterNot { it in alreadyAddedUsers }
if (usersToAdd.isEmpty()) {
Left(failure) // Avoid infinite loop
} else {
// ... add users ...
.flatMapLeft {
handleCommitFailure(
// ...
alreadyAddedUsers = alreadyAddedUsers + usersToAdd
)
}
}
}
3. Inconsistent Naming π‘ MEDIUM
Location: MLSConversationRepository.kt:189
suspend fun isGroupEpochStale(
mlsContext: MlsCoreCryptoContext,
groupID: GroupID,
currentEpoch: ULong
): Either<CoreFailure, Boolean>
Problem: The KDoc says "Determines whether the known group epoch is stale (outdated), when compared to the provided epoch" but the parameter is named currentEpoch. This is confusingβis currentEpoch the known epoch or the provided epoch?
Looking at the implementation:
mlsContext.conversationEpoch(groupID.value.decodeBase64Bytes()).map { knownEpoch ->
currentEpoch < knownEpoch // currentEpoch is compared against knownEpoch
}
Clarification Needed:
- If
currentEpochis the provided epoch we're checking, the logic seems backward (returns true if provided < known) - If
currentEpochis the known epoch, the parameter name is misleading
Recommendation: Rename for clarity:
suspend fun isGroupEpochStale(
mlsContext: MlsCoreCryptoContext,
groupID: GroupID,
providedEpoch: ULong // or 'remoteEpoch'
): Either<CoreFailure, Boolean>
And update KDoc:
/**
* Determines whether the provided epoch is stale compared to the locally known epoch.
*
* @param providedEpoch The epoch to check (e.g., from a remote message)
* @return true if providedEpoch < locally known epoch (meaning it's stale)
*/
4. TODO Comment Left in Production Code π‘ MEDIUM
Location: MessageSenderImpl.kt:68-70
// TODO(modularisation): Move to :messaging:sending.
// It ain't gonna be easy :)
// TODO(refactor): Split MLS and Proteus logic.
These TODOs are valuable but should be tracked in issue tracker rather than code comments that may become stale.
Recommendation: Create tracking issues and remove or minimize TODOs.
5. Sealed Interface with Single Level π΅ LOW
Location: MLSConversationDataSource.kt:238-241
private sealed interface CommitStrategy {
data object Retry : CommitStrategy
data object Abort : CommitStrategy
data class AddMissingUsers(val users: List<UserId>) : CommitStrategy
}
Observation: Changed from enum to sealed interface, which is good for extensibility. However, Retry and Abort are now data objects with no data.
Recommendation: Consider if object (without data) is more appropriate:
private sealed interface CommitStrategy {
object Retry : CommitStrategy
object Abort : CommitStrategy
data class AddMissingUsers(val users: List<UserId>) : CommitStrategy
}
data object provides structural equality and toString, but for singleton markers, plain object is more idiomatic.
6. Missing Edge Case Handling π‘ MEDIUM
Location: MLSMissingUsersMessageRejectionHandlerImpl.kt:56-65
Problem: What if:
missingUserscontains users that don't exist?missingUserscontains the self user?addMemberToMLSGrouppartially fails (some users added, some not)?
Recommendation: Add defensive checks:
val validUsers = missingUsers.filterNot { it == selfUserId }
if (validUsers.isEmpty()) {
logger.w("No valid users to add after filtering")
return Either.Right(Unit)
}
7. Resource Management in MLSConversationDataSource π΅ LOW
Location: MLSConversationDataSource.kt:820-826
Observation: The code fetches conversationDAO.getConversationByGroupID on every retry when handling AddMissingUsers. This is executed inside handleCommitFailure, which may be called multiple times.
Recommendation: Consider caching the cipher suite to avoid redundant DB queries:
// Pass cipherSuite as parameter to handleCommitFailure if available
π Security Considerations
β Good
- No obvious SQL injection risks (using parameterized queries through DAO)
- No exposed credentials or sensitive data in logs
- Proper authentication context management with
MlsCoreCryptoContext
β οΈ Concerns
- Authorization: Does the code verify that the current user has permission to add users to the group? Or is this handled by the backend?
- Denial of Service: Could a malicious backend repeatedly send
GroupOutOfSyncto cause resource exhaustion? Consider adding rate limiting or maximum retry bounds.
π Performance Considerations
Potential Issues
- Recursive Retries: As mentioned above, could lead to many network calls
- Sequential User Addition:
addMemberToMLSGroupmay add users one-by-one. Consider batch operations if supported. - Database Queries in Retry Loop: Fetching conversation protocol info on each retry attempt
Recommendations
- Add metrics/telemetry to track:
- Frequency of
GroupOutOfSyncerrors - Number of retries needed
- Number of users added per retry
- Frequency of
- Consider exponential backoff for retries
π Code Style & Best Practices
β Good
- Consistent with Kotlin and project conventions
- Proper use of coroutines
- Good separation of test arrangements
- Clear interface definitions
Minor Suggestions
- Consider adding
@VisibleForTestingannotation to test-only functions - The
MockProtocolInfoobject could have a more descriptive package (currently in root of data package)
π§ͺ Testing
β Strengths
- Comprehensive unit tests for new handler
- Integration tests updated
- Good use of test arrangements and fakes
- Tests cover happy path and error cases
π Suggestions
- Add tests for retry exhaustion: What happens when all attempts are used up?
- Test concurrent GroupOutOfSync: What if multiple messages fail simultaneously?
- Test performance under load: How does the system behave with large user lists?
- Add integration test: End-to-end test simulating the full flow from message send to user addition
π― Recommendations Summary
Must Fix (Before Merge)
- π΄ Fix recursive retry logic - Decrement
remainingAttemptsinAddMissingUsersbranch - π‘ Add infinite loop protection - Track already-added users
- π‘ Clarify
isGroupEpochStalenaming - Rename parameter for clarity
Should Fix (High Priority)
- π‘ Edge case handling - Filter self user, handle invalid users
- π‘ Remove TODO comments - Create issues instead
Consider (Nice to Have)
- π΅ Use
objectinstead ofdata objectfor marker types - π΅ Cache cipher suite to avoid redundant DB queries
- π΅ Add telemetry for monitoring retry behavior
- π΅ Security review - Verify authorization checks
β¨ Overall Assessment
Rating: 7.5/10 - Good implementation with strong architecture, but has a critical retry logic issue that needs addressing.
Positive Highlights
- Clean architecture with excellent use of functional interfaces
- Comprehensive test coverage
- Good error handling and logging
- Follows Kotlin best practices
Key Improvements Needed
- Fix retry counter logic to prevent potential infinite loops
- Clarify epoch comparison semantics
- Add protection against persistent GroupOutOfSync scenarios
Once the retry logic issue is addressed, this will be a solid, production-ready implementation. Great work on the architectural design and test coverage! π
Reviewed by Claude Code π€
Clunker is wrong :)
β¨ Overall Assessment
Rating: 6/10 - It helps with some stuff, but it's clearly dreaming during working hours, and raising false alarms.
There is no infinite loop. It literally decreases the retry counter and the retry operation does not reset the counter or anything like that.
I'll write a test to cover this case as well. Good for the future, and maybe it stops complaining.
@claude review
π€ Code Review Complete
Overview
This PR implements handling for GroupOutOfSync errors in MLS conversations by automatically adding missing users to MLS groups when messages/commits are rejected. The implementation is well-structured with good separation of concerns.
β Strengths
1. Excellent Architecture & Design
- Clean separation of concerns: New interfaces
MLSMemberAdderandConversationProtocolGettermake the code testable without heavy mocking - Single Responsibility:
MLSMissingUsersMessageRejectionHandlerhas one clear purpose - Strategic pattern use: The
CommitStrategysealed interface is a great improvement over the enum, enabling type-safe data passing
2. Strong Test Coverage
- Comprehensive unit tests for
MLSMissingUsersMessageRejectionHandlercovering all edge cases:- Non-MLS conversations
- Empty user lists
- Protocol fetch failures
- Member addition failures
- Successful scenarios
- Good use of test arrangements and mock-free testing via functional interfaces
3. Improved Code Quality
- Renamed
isGroupOutOfSyncβisLocalGroupEpochStalewith clear documentation (lines 189-192 in MLSConversationRepository.kt) - Changed
CommitStrategyfrom enum to sealed interface for better type safety - Consistent error handling patterns throughout
β οΈ Issues & Recommendations
π΄ Critical: Potential Infinite Retry Loop
Location: MLSConversationDataSource.kt:836-860 (in the handleCommitFailure for AddMissingUsers)
Issue: When handling GroupOutOfSync, the code adds missing users and retries, but then recursively calls handleCommitFailure again with the same remainingAttempts - 1. If adding users succeeds but the retry operation fails with another GroupOutOfSync error (with different users), this could consume all retry attempts quickly.
Current Flow:
is CommitStrategy.AddMissingUsers -> {
internalAddMemberToMLSGroup(..., retryOnMissingUsers = false, ...) // Good: prevents recursion in add
.flatMap { retryOperation() } // Retry the original operation
.flatMapLeft {
handleCommitFailure(..., remainingAttempts = remainingAttempts - 1, ...) // Uses same counter
}
}
Scenario:
- Attempt 1: Fails with
GroupOutOfSync(users=[A, B])β adds A, B β retry fails withGroupOutOfSync(users=[C]) - Attempt 2: Adds C β retry fails with
ClientMismatch - Attempt 3: Out of retries
Recommendation: Consider separate retry budget for AddMissingUsers or reset attempts after successful user addition:
is CommitStrategy.AddMissingUsers -> {
internalAddMemberToMLSGroup(...).flatMap {
retryOperation()
}.flatMapLeft {
// Reset attempts since we made progress
handleCommitFailure(..., remainingAttempts = 2, ...)
}
}
π‘ High: Inconsistent Retry Flags
Location: MLSConversationDataSource.kt:457 and MessageSenderImpl.kt:294
Issue 1: In internalAddMemberToMLSGroup, when handling GroupOutOfSync:
// Line 807: When handling commits
internalAddMemberToMLSGroup(mlsContext, groupID, strategy.users,
retryOnStaleMessage = true, // β οΈ true here
retryOnMissingUsers = false, // Correct: prevents recursion
cipherSuite
)
But the normal call path uses:
// Line 649: Normal establishment
internalAddMemberToMLSGroup(
retryOnStaleMessage = false, // β οΈ false here
retryOnMissingUsers = true,
...
)
Why it matters: Stale message errors during user addition in the recovery flow will be retried, but during initial establishment they won't be. This inconsistency could lead to hard-to-debug issues.
Issue 2: MessageSender doesn't use the new retry mechanism:
// MessageSenderImpl.kt:294
mlsMissingUsersMessageRejectionHandler.handle(...)
.flatMap { attemptToSend(transactionContext, message) }
This directly retries after adding users, bypassing the retryOnMissingUsers flag entirely. If the retry fails with another GroupOutOfSync, it will not be handled again.
Recommendation:
- Document why
retryOnStaleMessagediffers between call sites, or unify the behavior - Consider if
MessageSendershould respectretryOnMissingUsersor have its own retry limit
π‘ Medium: Race Condition Potential
Location: MLSConversationDataSource.kt:806
Issue: The code fetches cipher suite from the database during commit failure handling:
is CommitStrategy.AddMissingUsers -> {
wrapStorageRequest { conversationDAO.getConversationByGroupID(groupID.value) }
.flatMap { /* extract cipher suite */ }
.flatMap { cipherSuite -> internalAddMemberToMLSGroup(...) }
}
Concern: If the conversation's cipher suite changes between the original operation and the failure handler (unlikely but possible during migration), this could cause issues. The original produceAndSendCommitWithRetryAndResult call had access to the cipher suite from the call site.
Recommendation: Consider passing cipher suite through the retry chain or add validation that it matches expectations.
π‘ Medium: Missing Integration Tests
Observation: The new test files provide excellent unit test coverage, but I don't see integration tests that verify:
- End-to-end flow: Message send β GroupOutOfSync β Auto-add users β Retry β Success
- Multiple sequential GroupOutOfSync errors
- Interaction with existing retry mechanisms (StaleMessage, ClientMismatch)
Recommendation: Add integration tests in MessageSenderTest.kt covering:
@Test
fun givenGroupOutOfSync_whenSendingMessage_thenAddsUsersAndRetries()
@Test
fun givenMultipleGroupOutOfSync_whenSendingMessage_thenAddsAllUsersInSequence()
@Test
fun givenGroupOutOfSyncThenStaleMessage_whenSendingMessage_thenHandlesBothErrors()
π’ Low: Minor Code Quality
- Line 809: Formatting inconsistency -
elseon same line as closing brace - Line 457: Parameter order in
internalAddMemberToMLSGroup-allowPartialMemberListbetween retry flags breaks logical grouping - MockProtocolInfo.kt: Good addition, but consider adding a factory for
ESTABLISHEDstate since that's what's typically needed in tests
π Security Assessment
β No security concerns identified
- User IDs come from server response, not user input
- No elevation of privileges
- Proper validation of conversation protocol types
- Failed operations result in message failure, not silent success
β‘ Performance Considerations
β Generally good, minor observations:
- Database queries during retry: The cipher suite lookup happens during error handling, adding latency. Not critical since this is already a failure path.
- Retry amplification: If many messages hit
GroupOutOfSyncsimultaneously, this could amplify key package claiming load. The existing key package limits should mitigate this. - Network efficiency: Automatically adding users is more efficient than failing and requiring manual intervention.
π Summary
| Category | Status |
|---|---|
| Code Quality | βββββ Excellent |
| Test Coverage | ββββ Good (missing integration tests) |
| Security | β No concerns |
| Performance | β Acceptable |
| Architecture | βββββ Excellent |
Verdict: Approve with Minor Concerns
This is high-quality work with excellent architecture. The critical issue around retry loop exhaustion should be addressed, but it's a refinement rather than a blocker. The code is ready to merge after addressing the retry attempt handling.
Action Items (Priority Order):
- π΄ Before merge: Address infinite retry loop concern - clarify retry attempt strategy for
AddMissingUsers - π‘ Before merge: Document or unify
retryOnStaleMessagebehavior across call sites - π‘ Follow-up: Add integration tests for multi-error scenarios
- π’ Optional: Minor code formatting improvements
Additional Notes
- β
The TODO comment removal at line 265 (
TODO("Handle Group Out Of Sync")) is satisfying to see! - β Great job avoiding heavy mocking by using functional interfaces
- β
Excellent KDoc on
isLocalGroupEpochStaleexplaining the renamed method - βΉοΈ The disabled
monkeys/build.gradle.ktsappears unrelated to this PR
Review completed with β€οΈ by Claude Code Analyzed 22 changed files, 737 additions, 97 deletions
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code