kalium icon indicating copy to clipboard operation
kalium copied to clipboard

feat(mls): Add users to out-of-sync groups [WPB-20862]

Open vitorhugods opened this issue 1 week ago β€’ 9 comments

TaskWPB-20862 [Android] Add clients to an existing channel


PR Submission Checklist for internal contributors

  • The PR Title

    • [X] conforms to the style of semantic commits messagesΒΉ supported in Wire's Github WorkflowΒ²
    • [X] contains a reference JIRA issue number like SQPIT-764
    • [X] answers the question: If merged, this PR will: ... Β³
  • The PR Description

    • [X] is free of optional paragraphs and you have filled the relevant parts to the best of your ability

What's new in this PR?

Issues

There is a new feature on the DS, where users can be added to a conversation without being added to the MLS group itself. When a user sends a message or commit, it will be rejected with a mls-group-out-of-sync response, and a list of missing user IDs.

Solutions

The new response parsing was already added, now I am adding the code to actually handle the logic.

  • Add MLSMissingUsersMessageRejectionHandler to resolve GroupOutOfSync errors when sending messages by adding missing users to MLS groups.
  • Update MessageSender to retry sending messages after handling missing users.
  • Update the Retry logic when failing to send commits

[!NOTE] To make writing tests easier without having to rely on mocking libs or creating huge fakes, I've created fun interfaces, like ConversationProtocolGetter and MLSMemberAdder instead of having to rely on big interfaces like ConversationRepository or MLSConversationRepository

Testing

Test Coverage

  • [X] I have added automated test to this contribution

vitorhugods avatar Nov 18 '25 08:11 vitorhugods

Test Results

0 tests   -β€Š4β€ˆ095   0 βœ…  -β€Š3β€ˆ984   0s ⏱️ - 5m 40s 0 suites  -β€Šβ€‡β€ˆ689   0 πŸ’€  -β€Šβ€‡β€ˆ111  0 files    -β€Šβ€‡β€ˆ689   0 ❌ Β±β€‡β€ˆβ€‡β€‡0 

Results for commit 3d0a466e. ± Comparison against base commit 8476fc0e.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Nov 18 '25 08:11 github-actions[bot]

🐰 Bencher Report

Branchmls/feat/add-users-to-out-of-sync-groups
Testbedubuntu-latest

⚠️ WARNING: No Threshold found!

Without a Threshold, no Alerts will ever be generated.

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results
BenchmarkLatencymicroseconds (Β΅s)
com.wire.kalium.benchmarks.logic.CoreLogicBenchmark.createObjectInFilesπŸ“ˆ view plot
⚠️ NO THRESHOLD
555.02 Β΅s
com.wire.kalium.benchmarks.logic.CoreLogicBenchmark.createObjectInMemoryπŸ“ˆ view plot
⚠️ NO THRESHOLD
424,667.66 Β΅s
com.wire.kalium.benchmarks.persistence.MessagesNoPragmaTuneBenchmark.messageInsertionBenchmarkπŸ“ˆ view plot
⚠️ NO THRESHOLD
1,431,446.44 Β΅s
com.wire.kalium.benchmarks.persistence.MessagesNoPragmaTuneBenchmark.queryMessagesBenchmarkπŸ“ˆ view plot
⚠️ NO THRESHOLD
28,990.19 Β΅s
🐰 View full continuous benchmarking report in Bencher

github-actions[bot] avatar Nov 18 '25 09:11 github-actions[bot]

Codecov Report

:x: Patch coverage is 76.08696% with 33 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 57.32%. Comparing base (8476fc0) to head (3d0a466).

Files with missing lines Patch % Lines
...gic/data/conversation/MLSConversationRepository.kt 77.27% 6 Missing and 9 partials :warning:
...alium/logic/data/mls/ConversationProtocolGetter.kt 0.00% 7 Missing :warning:
...lin/com/wire/kalium/logic/data/MockProtocolInfo.kt 50.00% 6 Missing :warning:
.../com/wire/kalium/logic/feature/debug/DebugScope.kt 0.00% 2 Missing :warning:
.../wire/kalium/logic/feature/message/MessageScope.kt 0.00% 2 Missing :warning:
...r/conversation/message/MLSMessageFailureHandler.kt 0.00% 0 Missing and 1 partial :warning:
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3712      +/-   ##
===========================================
+ Coverage    57.29%   57.32%   +0.02%     
===========================================
  Files         1791     1794       +3     
  Lines        59879    59965      +86     
  Branches      5865     5880      +15     
===========================================
+ Hits         34309    34376      +67     
- Misses       23136    23146      +10     
- Partials      2434     2443       +9     
Files with missing lines Coverage Ξ”
.../com/wire/kalium/logic/data/e2ei/E2EIRepository.kt 80.19% <100.00%> (ΓΈ)
...data/mls/MLSMissingUsersMessageRejectionHandler.kt 100.00% <100.00%> (ΓΈ)
...ure/conversation/RecoverMLSConversationsUseCase.kt 78.12% <100.00%> (ΓΈ)
.../kalium/logic/feature/message/MessageSenderImpl.kt 95.60% <100.00%> (+0.09%) :arrow_up:
...kalium/logic/feature/message/StaleEpochVerifier.kt 100.00% <100.00%> (ΓΈ)
...r/conversation/message/MLSMessageFailureHandler.kt 16.66% <0.00%> (ΓΈ)
.../com/wire/kalium/logic/feature/debug/DebugScope.kt 0.00% <0.00%> (ΓΈ)
.../wire/kalium/logic/feature/message/MessageScope.kt 0.00% <0.00%> (ΓΈ)
...lin/com/wire/kalium/logic/data/MockProtocolInfo.kt 50.00% <50.00%> (ΓΈ)
...alium/logic/data/mls/ConversationProtocolGetter.kt 0.00% <0.00%> (ΓΈ)
... and 1 more

... and 3 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data Powered by Codecov. Last update 8476fc0...3d0a466. Read the comment docs.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov-commenter avatar Nov 18 '25 09:11 codecov-commenter

@claude review

MohamadJaara avatar Nov 19 '25 09:11 MohamadJaara

Code Review for PR #3712: Add users to out-of-sync groups WPB-20862

Summary

This PR implements functionality to handle MLS group out-of-sync errors by automatically adding missing users when a message is rejected by the backend. The implementation is well-structured with good separation of concerns through functional interfaces.


βœ… Strengths

1. Excellent Architecture & Design

  • Functional Interfaces: Great use of MLSMemberAdder and ConversationProtocolGetter for focused, testable abstractions
  • Single Responsibility: MLSMissingUsersMessageRejectionHandler has a clear, focused purpose
  • Separation of Concerns: Clean separation between message sending, error handling, and MLS operations

2. Comprehensive Test Coverage

  • βœ… New handler has dedicated test file (MLSMissingUsersMessageRejectionHandlerTest.kt)
  • βœ… Integration tests in MessageSenderTest.kt updated
  • βœ… Test arrangement helpers created for easier testing
  • βœ… Edge cases considered (empty user lists, protocol mismatches)

3. Good Error Handling

  • Proper use of Either monad for error propagation
  • Appropriate logging at different levels (logger.i, logger.w, logger.d)
  • Clear error messages with context

4. Code Quality

  • Consistent Kotlin idioms
  • Good use of exhaustive when expressions
  • Proper visibility modifiers (internal)
  • Clear naming conventions

⚠️ Issues & Concerns

1. Recursive Retry Logic Risk πŸ”΄ CRITICAL

Location: MLSConversationDataSource.kt:803-826

CommitStrategy.AddMissingUsers -> {
    // ... adds users ...
    retryOperation().flatMapLeft {
        handleCommitFailure(  // ⚠️ Recursive call
            mlsContext = mlsContext,
            failure = it,
            groupID = groupID,
            remainingAttempts = remainingAttempts - 1,  // ⚠️ Not decremented for initial AddMissingUsers
            // ...
        )
    }
}

Problem: When AddMissingUsers is triggered, it retries the operation and then calls handleCommitFailure again with the same remainingAttempts value. This means:

  • If adding users succeeds but the retry fails with another GroupOutOfSync, you'll attempt to add users again
  • This could lead to infinite loops or many more retries than intended
  • The retry counter doesn't account for the AddMissingUsers strategy attempt

Recommendation: Decrement remainingAttempts when entering the AddMissingUsers path:

is CommitStrategy.AddMissingUsers -> {
    wrapStorageRequest { conversationDAO.getConversationByGroupID(groupID.value) }.flatMap {
        // ... get cipher suite ...
    }.flatMap { cipherSuite ->
        addMemberToMLSGroup(mlsContext, groupID, strategy.users, cipherSuite)
    }.flatMap {
        retryOperation()
    }.flatMapLeft {
        handleCommitFailure(
            mlsContext = mlsContext,
            failure = it,
            groupID = groupID,
            remainingAttempts = remainingAttempts - 1,  // βœ… Decrement here
            retryOnClientMismatch = retryOnClientMismatch,
            retryOnStaleMessage = retryOnStaleMessage,
            retryOperation = retryOperation
        )
    }
}

2. Potential Infinite Loop on Persistent GroupOutOfSync 🟑 MEDIUM

Location: MLSConversationDataSource.kt:803-826

Problem: If the backend consistently returns GroupOutOfSync (e.g., due to a bug or race condition), the code will:

  1. Add missing users
  2. Retry
  3. Get GroupOutOfSync again (possibly with different users)
  4. Add those users
  5. Repeat until remainingAttempts exhausted

This could result in multiple user additions that may not be intended.

Recommendation: Track which users have already been added and avoid re-adding:

// In handleCommitFailure signature
suspend fun <T> handleCommitFailure(
    // ... existing params ...
    alreadyAddedUsers: Set<UserId> = emptySet()  // Track added users
): Either<CoreFailure, T>

// In AddMissingUsers branch
is CommitStrategy.AddMissingUsers -> {
    val usersToAdd = strategy.users.filterNot { it in alreadyAddedUsers }
    if (usersToAdd.isEmpty()) {
        Left(failure)  // Avoid infinite loop
    } else {
        // ... add users ...
        .flatMapLeft {
            handleCommitFailure(
                // ...
                alreadyAddedUsers = alreadyAddedUsers + usersToAdd
            )
        }
    }
}

3. Inconsistent Naming 🟑 MEDIUM

Location: MLSConversationRepository.kt:189

suspend fun isGroupEpochStale(
    mlsContext: MlsCoreCryptoContext,
    groupID: GroupID,
    currentEpoch: ULong
): Either<CoreFailure, Boolean>

Problem: The KDoc says "Determines whether the known group epoch is stale (outdated), when compared to the provided epoch" but the parameter is named currentEpoch. This is confusingβ€”is currentEpoch the known epoch or the provided epoch?

Looking at the implementation:

mlsContext.conversationEpoch(groupID.value.decodeBase64Bytes()).map { knownEpoch ->
    currentEpoch < knownEpoch  // currentEpoch is compared against knownEpoch
}

Clarification Needed:

  • If currentEpoch is the provided epoch we're checking, the logic seems backward (returns true if provided < known)
  • If currentEpoch is the known epoch, the parameter name is misleading

Recommendation: Rename for clarity:

suspend fun isGroupEpochStale(
    mlsContext: MlsCoreCryptoContext,
    groupID: GroupID,
    providedEpoch: ULong  // or 'remoteEpoch'
): Either<CoreFailure, Boolean>

And update KDoc:

/**
 * Determines whether the provided epoch is stale compared to the locally known epoch.
 * 
 * @param providedEpoch The epoch to check (e.g., from a remote message)
 * @return true if providedEpoch < locally known epoch (meaning it's stale)
 */

4. TODO Comment Left in Production Code 🟑 MEDIUM

Location: MessageSenderImpl.kt:68-70

// TODO(modularisation): Move to :messaging:sending.
//                       It ain't gonna be easy :)
// TODO(refactor): Split MLS and Proteus logic.

These TODOs are valuable but should be tracked in issue tracker rather than code comments that may become stale.

Recommendation: Create tracking issues and remove or minimize TODOs.

5. Sealed Interface with Single Level πŸ”΅ LOW

Location: MLSConversationDataSource.kt:238-241

private sealed interface CommitStrategy {
    data object Retry : CommitStrategy
    data object Abort : CommitStrategy
    data class AddMissingUsers(val users: List<UserId>) : CommitStrategy
}

Observation: Changed from enum to sealed interface, which is good for extensibility. However, Retry and Abort are now data objects with no data.

Recommendation: Consider if object (without data) is more appropriate:

private sealed interface CommitStrategy {
    object Retry : CommitStrategy
    object Abort : CommitStrategy
    data class AddMissingUsers(val users: List<UserId>) : CommitStrategy
}

data object provides structural equality and toString, but for singleton markers, plain object is more idiomatic.

6. Missing Edge Case Handling 🟑 MEDIUM

Location: MLSMissingUsersMessageRejectionHandlerImpl.kt:56-65

Problem: What if:

  • missingUsers contains users that don't exist?
  • missingUsers contains the self user?
  • addMemberToMLSGroup partially fails (some users added, some not)?

Recommendation: Add defensive checks:

val validUsers = missingUsers.filterNot { it == selfUserId }
if (validUsers.isEmpty()) {
    logger.w("No valid users to add after filtering")
    return Either.Right(Unit)
}

7. Resource Management in MLSConversationDataSource πŸ”΅ LOW

Location: MLSConversationDataSource.kt:820-826

Observation: The code fetches conversationDAO.getConversationByGroupID on every retry when handling AddMissingUsers. This is executed inside handleCommitFailure, which may be called multiple times.

Recommendation: Consider caching the cipher suite to avoid redundant DB queries:

// Pass cipherSuite as parameter to handleCommitFailure if available

πŸ” Security Considerations

βœ… Good

  • No obvious SQL injection risks (using parameterized queries through DAO)
  • No exposed credentials or sensitive data in logs
  • Proper authentication context management with MlsCoreCryptoContext

⚠️ Concerns

  1. Authorization: Does the code verify that the current user has permission to add users to the group? Or is this handled by the backend?
  2. Denial of Service: Could a malicious backend repeatedly send GroupOutOfSync to cause resource exhaustion? Consider adding rate limiting or maximum retry bounds.

πŸ“Š Performance Considerations

Potential Issues

  1. Recursive Retries: As mentioned above, could lead to many network calls
  2. Sequential User Addition: addMemberToMLSGroup may add users one-by-one. Consider batch operations if supported.
  3. Database Queries in Retry Loop: Fetching conversation protocol info on each retry attempt

Recommendations

  • Add metrics/telemetry to track:
    • Frequency of GroupOutOfSync errors
    • Number of retries needed
    • Number of users added per retry
  • Consider exponential backoff for retries

πŸ“ Code Style & Best Practices

βœ… Good

  • Consistent with Kotlin and project conventions
  • Proper use of coroutines
  • Good separation of test arrangements
  • Clear interface definitions

Minor Suggestions

  1. Consider adding @VisibleForTesting annotation to test-only functions
  2. The MockProtocolInfo object could have a more descriptive package (currently in root of data package)

πŸ§ͺ Testing

βœ… Strengths

  • Comprehensive unit tests for new handler
  • Integration tests updated
  • Good use of test arrangements and fakes
  • Tests cover happy path and error cases

πŸ“Œ Suggestions

  1. Add tests for retry exhaustion: What happens when all attempts are used up?
  2. Test concurrent GroupOutOfSync: What if multiple messages fail simultaneously?
  3. Test performance under load: How does the system behave with large user lists?
  4. Add integration test: End-to-end test simulating the full flow from message send to user addition

🎯 Recommendations Summary

Must Fix (Before Merge)

  1. πŸ”΄ Fix recursive retry logic - Decrement remainingAttempts in AddMissingUsers branch
  2. 🟑 Add infinite loop protection - Track already-added users
  3. 🟑 Clarify isGroupEpochStale naming - Rename parameter for clarity

Should Fix (High Priority)

  1. 🟑 Edge case handling - Filter self user, handle invalid users
  2. 🟑 Remove TODO comments - Create issues instead

Consider (Nice to Have)

  1. πŸ”΅ Use object instead of data object for marker types
  2. πŸ”΅ Cache cipher suite to avoid redundant DB queries
  3. πŸ”΅ Add telemetry for monitoring retry behavior
  4. πŸ”΅ Security review - Verify authorization checks

✨ Overall Assessment

Rating: 7.5/10 - Good implementation with strong architecture, but has a critical retry logic issue that needs addressing.

Positive Highlights

  • Clean architecture with excellent use of functional interfaces
  • Comprehensive test coverage
  • Good error handling and logging
  • Follows Kotlin best practices

Key Improvements Needed

  • Fix retry counter logic to prevent potential infinite loops
  • Clarify epoch comparison semantics
  • Add protection against persistent GroupOutOfSync scenarios

Once the retry logic issue is addressed, this will be a solid, production-ready implementation. Great work on the architectural design and test coverage! πŸŽ‰


Reviewed by Claude Code πŸ€–

github-actions[bot] avatar Nov 19 '25 09:11 github-actions[bot]

Clunker is wrong :)

✨ Overall Assessment

Rating: 6/10 - It helps with some stuff, but it's clearly dreaming during working hours, and raising false alarms.


There is no infinite loop. It literally decreases the retry counter and the retry operation does not reset the counter or anything like that.

I'll write a test to cover this case as well. Good for the future, and maybe it stops complaining.

vitorhugods avatar Nov 19 '25 09:11 vitorhugods

@claude review

vitorhugods avatar Nov 19 '25 10:11 vitorhugods

πŸ€– Code Review Complete

Overview

This PR implements handling for GroupOutOfSync errors in MLS conversations by automatically adding missing users to MLS groups when messages/commits are rejected. The implementation is well-structured with good separation of concerns.


βœ… Strengths

1. Excellent Architecture & Design

  • Clean separation of concerns: New interfaces MLSMemberAdder and ConversationProtocolGetter make the code testable without heavy mocking
  • Single Responsibility: MLSMissingUsersMessageRejectionHandler has one clear purpose
  • Strategic pattern use: The CommitStrategy sealed interface is a great improvement over the enum, enabling type-safe data passing

2. Strong Test Coverage

  • Comprehensive unit tests for MLSMissingUsersMessageRejectionHandler covering all edge cases:
    • Non-MLS conversations
    • Empty user lists
    • Protocol fetch failures
    • Member addition failures
    • Successful scenarios
  • Good use of test arrangements and mock-free testing via functional interfaces

3. Improved Code Quality

  • Renamed isGroupOutOfSync β†’ isLocalGroupEpochStale with clear documentation (lines 189-192 in MLSConversationRepository.kt)
  • Changed CommitStrategy from enum to sealed interface for better type safety
  • Consistent error handling patterns throughout

⚠️ Issues & Recommendations

πŸ”΄ Critical: Potential Infinite Retry Loop

Location: MLSConversationDataSource.kt:836-860 (in the handleCommitFailure for AddMissingUsers)

Issue: When handling GroupOutOfSync, the code adds missing users and retries, but then recursively calls handleCommitFailure again with the same remainingAttempts - 1. If adding users succeeds but the retry operation fails with another GroupOutOfSync error (with different users), this could consume all retry attempts quickly.

Current Flow:

is CommitStrategy.AddMissingUsers -> {
    internalAddMemberToMLSGroup(..., retryOnMissingUsers = false, ...) // Good: prevents recursion in add
        .flatMap { retryOperation() } // Retry the original operation
        .flatMapLeft { 
            handleCommitFailure(..., remainingAttempts = remainingAttempts - 1, ...) // Uses same counter
        }
}

Scenario:

  1. Attempt 1: Fails with GroupOutOfSync(users=[A, B]) β†’ adds A, B β†’ retry fails with GroupOutOfSync(users=[C])
  2. Attempt 2: Adds C β†’ retry fails with ClientMismatch
  3. Attempt 3: Out of retries

Recommendation: Consider separate retry budget for AddMissingUsers or reset attempts after successful user addition:

is CommitStrategy.AddMissingUsers -> {
    internalAddMemberToMLSGroup(...).flatMap { 
        retryOperation() 
    }.flatMapLeft { 
        // Reset attempts since we made progress
        handleCommitFailure(..., remainingAttempts = 2, ...) 
    }
}

🟑 High: Inconsistent Retry Flags

Location: MLSConversationDataSource.kt:457 and MessageSenderImpl.kt:294

Issue 1: In internalAddMemberToMLSGroup, when handling GroupOutOfSync:

// Line 807: When handling commits
internalAddMemberToMLSGroup(mlsContext, groupID, strategy.users, 
    retryOnStaleMessage = true,  // ⚠️ true here
    retryOnMissingUsers = false,  // Correct: prevents recursion
    cipherSuite
)

But the normal call path uses:

// Line 649: Normal establishment
internalAddMemberToMLSGroup(
    retryOnStaleMessage = false,  // ⚠️ false here
    retryOnMissingUsers = true,
    ...
)

Why it matters: Stale message errors during user addition in the recovery flow will be retried, but during initial establishment they won't be. This inconsistency could lead to hard-to-debug issues.

Issue 2: MessageSender doesn't use the new retry mechanism:

// MessageSenderImpl.kt:294
mlsMissingUsersMessageRejectionHandler.handle(...)
    .flatMap { attemptToSend(transactionContext, message) }

This directly retries after adding users, bypassing the retryOnMissingUsers flag entirely. If the retry fails with another GroupOutOfSync, it will not be handled again.

Recommendation:

  1. Document why retryOnStaleMessage differs between call sites, or unify the behavior
  2. Consider if MessageSender should respect retryOnMissingUsers or have its own retry limit

🟑 Medium: Race Condition Potential

Location: MLSConversationDataSource.kt:806

Issue: The code fetches cipher suite from the database during commit failure handling:

is CommitStrategy.AddMissingUsers -> {
    wrapStorageRequest { conversationDAO.getConversationByGroupID(groupID.value) }
        .flatMap { /* extract cipher suite */ }
        .flatMap { cipherSuite -> internalAddMemberToMLSGroup(...) }
}

Concern: If the conversation's cipher suite changes between the original operation and the failure handler (unlikely but possible during migration), this could cause issues. The original produceAndSendCommitWithRetryAndResult call had access to the cipher suite from the call site.

Recommendation: Consider passing cipher suite through the retry chain or add validation that it matches expectations.

🟑 Medium: Missing Integration Tests

Observation: The new test files provide excellent unit test coverage, but I don't see integration tests that verify:

  • End-to-end flow: Message send β†’ GroupOutOfSync β†’ Auto-add users β†’ Retry β†’ Success
  • Multiple sequential GroupOutOfSync errors
  • Interaction with existing retry mechanisms (StaleMessage, ClientMismatch)

Recommendation: Add integration tests in MessageSenderTest.kt covering:

@Test
fun givenGroupOutOfSync_whenSendingMessage_thenAddsUsersAndRetries()

@Test  
fun givenMultipleGroupOutOfSync_whenSendingMessage_thenAddsAllUsersInSequence()

@Test
fun givenGroupOutOfSyncThenStaleMessage_whenSendingMessage_thenHandlesBothErrors()

🟒 Low: Minor Code Quality

  1. Line 809: Formatting inconsistency - else on same line as closing brace
  2. Line 457: Parameter order in internalAddMemberToMLSGroup - allowPartialMemberList between retry flags breaks logical grouping
  3. MockProtocolInfo.kt: Good addition, but consider adding a factory for ESTABLISHED state since that's what's typically needed in tests

πŸ”’ Security Assessment

βœ… No security concerns identified

  • User IDs come from server response, not user input
  • No elevation of privileges
  • Proper validation of conversation protocol types
  • Failed operations result in message failure, not silent success

⚑ Performance Considerations

βœ… Generally good, minor observations:

  1. Database queries during retry: The cipher suite lookup happens during error handling, adding latency. Not critical since this is already a failure path.
  2. Retry amplification: If many messages hit GroupOutOfSync simultaneously, this could amplify key package claiming load. The existing key package limits should mitigate this.
  3. Network efficiency: Automatically adding users is more efficient than failing and requiring manual intervention.

πŸ“‹ Summary

Category Status
Code Quality ⭐⭐⭐⭐⭐ Excellent
Test Coverage ⭐⭐⭐⭐ Good (missing integration tests)
Security βœ… No concerns
Performance βœ… Acceptable
Architecture ⭐⭐⭐⭐⭐ Excellent

Verdict: Approve with Minor Concerns

This is high-quality work with excellent architecture. The critical issue around retry loop exhaustion should be addressed, but it's a refinement rather than a blocker. The code is ready to merge after addressing the retry attempt handling.

Action Items (Priority Order):

  1. πŸ”΄ Before merge: Address infinite retry loop concern - clarify retry attempt strategy for AddMissingUsers
  2. 🟑 Before merge: Document or unify retryOnStaleMessage behavior across call sites
  3. 🟑 Follow-up: Add integration tests for multi-error scenarios
  4. 🟒 Optional: Minor code formatting improvements

Additional Notes

  • βœ… The TODO comment removal at line 265 (TODO("Handle Group Out Of Sync")) is satisfying to see!
  • βœ… Great job avoiding heavy mocking by using functional interfaces
  • βœ… Excellent KDoc on isLocalGroupEpochStale explaining the renamed method
  • ℹ️ The disabled monkeys/build.gradle.kts appears unrelated to this PR

Review completed with ❀️ by Claude Code Analyzed 22 changed files, 737 additions, 97 deletions

github-actions[bot] avatar Nov 19 '25 10:11 github-actions[bot]