robrix icon indicating copy to clipboard operation
robrix copied to clipboard

logout feature

Open TigerInYourDream opened this issue 9 months ago • 12 comments

Based on PR #293, I found an error message during logout: 2024-12-24T07:58:55.514036Z ERROR matrix_sdk_ui::sync_service: Error while processing encryption in sync service: Something wrong happened in sliding sync: the server returned an error: [401 / M_UNKNOWN_TOKEN] Invalid access token passed.
Root cause: The error is related to SYNC_SERVICE handling during logout process.

Changes:

  • Updated SYNC_SERVICE handling during logout
  • Tested the fix with Google SSO login/logout flow

TigerInYourDream avatar Mar 11 '25 15:03 TigerInYourDream

Thanks Alvin. I'm a bit confused -- is this a replacement for #293, or something that should be merged in after #293 as an addition to it?

Also, is it ready for review or shall I wait for something else?

kevinaboos avatar Mar 11 '25 20:03 kevinaboos

Thanks Alvin. I'm a bit confused -- is this a replacement for #293, or something that should be merged in after #293 as an addition to it?

Also, is it ready for review or shall I wait for something else?

Due to the age of PR #293 and substantial changes in the codebase, resolving merge conflicts became quite challenging. I've contacted the original author, Guocork, who also suggested creating a new PR. Therefore, I've created a fresh PR implementing the same functionality with updated code and comprehensive testing. Could you please review this new implementation?

TigerInYourDream avatar Mar 12 '25 04:03 TigerInYourDream

Thanks Alvin, great start on this feature!

I left some comments below. In general, make sure to name things carefully --- try to give variables and types very specific names that cannot be misinterpreted. (I know naming is hard, so it'll take some time to get that right. No worries.)

Thank you for the review. I will try to improve the naming conventions and continue working on it.

TigerInYourDream avatar Mar 13 '25 04:03 TigerInYourDream

  1. Suggest adding a modal for logout confirmation
  2. Perhaps delete the "latest_user_id.txt" in the app data directory
  3. After logout, restarting the app will see a prompt saying fail to "Could not restore previous user session" Screenshot 2025-03-18 at 3 51 49 PM

alanpoon avatar Mar 18 '25 07:03 alanpoon

  1. Suggest adding a modal for logout confirmation
  2. Perhaps delete the "latest_user_id.txt" in the app data directory
  3. After logout, restarting the app will see a prompt saying fail to "Could not restore previous user session" Screenshot 2025-03-18 at 3 51 49 PM

Thanks for the review.

TigerInYourDream avatar Mar 18 '25 13:03 TigerInYourDream

In addition to specific review comments, the following suggestions are also made:

  1. It may be necessary to delete or mark session files as invalid simultaneously.
  2. Improve error handling and logging to ensure that users are aware of whether the logout was successful.
  3. Ensure consistency of UI state regardless of whether the logout was successful or not.

update it. Give more information about err message. Improve error handling. and give a new function about closing all tabs before logout.

TigerInYourDream avatar Apr 06 '25 16:04 TigerInYourDream

It is might better using Tokio's shutdown background to close all asynchronous task. I have asked AI about it and I have asked vaguely tried it. It looks promising and have lesser code changes. b.txt Perhaps timeout with arbitrary duration can be removed.

alanpoon avatar May 02 '25 07:05 alanpoon

I once implemented a modal for Moly: https://github.com/moxin-org/moly/pull/341/files. There is a very simple modal in Moly: https://github.com/moxin-org/moly/blob/main/src/shared/modal.rs you can check out both of these for reference.

Guocork avatar May 08 '25 02:05 Guocork

It is might better using Tokio's shutdown background to close all asynchronous task. I have asked AI about it and I have asked vaguely tried it. It looks promising and have lesser code changes. b.txt Perhaps timeout with arbitrary duration can be removed.

I opted against using both shutdown_background and shutdown_timeout because they immediately begin closing the Tokio runtime, which causes subsequent async tasks to fail. Our logout flow still requires the runtime for critical operations like cleaning up state and sending notifications after the core tasks are aborted. I tested both shutdown methods and they proved problematic, resulting in incomplete logout processes where the application wouldn't properly transition to the login state.

TigerInYourDream avatar May 17 '25 03:05 TigerInYourDream

It is might better using Tokio's shutdown background to close all asynchronous task. I have asked AI about it and I have asked vaguely tried it. It looks promising and have lesser code changes. b.txt Perhaps timeout with arbitrary duration can be removed.

I opted against using both shutdown_background and shutdown_timeout because they immediately begin closing the Tokio runtime, which causes subsequent async tasks to fail. Our logout flow still requires the runtime for critical operations like cleaning up state and sending notifications after the core tasks are aborted. I tested both shutdown methods and they proved problematic, resulting in incomplete logout processes where the application wouldn't properly transition to the login state.

I don't see what issue there is https://github.com/alanpoon/robrix/tree/fix-logout-shutdown. The issue is that your "logout_successful = true" is inside successful response of client.matrix_auth().logout() which is not necessary. If the access token is invalid -> client.matrix_auth().logout() will fail. But access token is invalid is also logged out state.

alanpoon avatar May 20 '25 11:05 alanpoon

I noticed that my review has been requested here and that this has been marked as waiting-on-review. However, there's also another PR #494 that seems to supersede this.

Can you guys clarify the relationship between these PRs and which one I should be looking at?

kevinaboos avatar May 29 '25 20:05 kevinaboos

Regarding PR494, I had a discussion with @alanpoon about which method to use for terminating asynchronous tasks in Tokio. After our discussion, we concluded that using shutdown_background would make this PR more concise and efficient.

As a result, I have updated the original code to implement the shutdown_background version. This change simplifies the task termination process while maintaining the desired functionality.

I kindly request a review of the current PR to ensure that the implementation aligns with our project standards and effectively addresses the task termination requirements.

Your feedback on this updated approach would be greatly appreciated. Please let me know if you need any further information or clarification on the changes made.

Thank you for your time and consideration. @kevinaboos

TigerInYourDream avatar May 30 '25 03:05 TigerInYourDream

Ok, that adds a bit of detail, but doesn't fully answer my question. Is #494 being closed? Is this being merged in and then #494 afterwards? @alanpoon's previous comments seem to indicate that the approach hasn't been settled on.

kevinaboos avatar May 30 '25 17:05 kevinaboos

#494

@kevinaboos I've discussed this with alanpoon, and as a result, he has closed PR494.

Please review the current PR, which incorporates the agreed-upon changes and approach.

Thank you for your time and consideration.

TigerInYourDream avatar Jun 04 '25 02:06 TigerInYourDream

@ZhangHanDong can you please review this PR and the changes related to the room subscription stuff? there are a few modifications to files that you own.

kevinaboos avatar Jun 05 '25 17:06 kevinaboos

Thank you for your thorough code review and valuable feedback. I appreciate the time you've taken to examine my code in detail.

I will carefully address each of your suggestions, including removing excess logging and improving naming conventions. The initial verbose logging was intended to provide clarity during the review process.

I'll make the necessary improvements based on your comments and submit an updated version soon.

Thank you again for your insights and guidance.

TigerInYourDream avatar Jun 06 '25 04:06 TigerInYourDream

https://github.com/project-robius/robrix/blob/e888e99fa0a4b965a2228b58c56831ad2eea8838/src/sliding_sync.rs#L1187 shall we add a break in the loop when it returns error, rather than error popup?

alanpoon avatar Jun 20 '25 07:06 alanpoon

https://github.com/project-robius/robrix/blob/e888e99fa0a4b965a2228b58c56831ad2eea8838/src/sliding_sync.rs#L1187

shall we add a break in the loop when it returns error, rather than error popup?

i don't think it matters much. We could break out of the loop and try to restart all of the async loop tasks, so if you want to do that then it would be appropriate to not show an error message. Since we're not doing that now, this error is pretty much "fatal" and will require restarting Robrix itself.

kevinaboos avatar Jun 20 '25 18:06 kevinaboos

Thank you for your detailed feedback, Kevin. I'll address both these new comments and the previous suggestions.

TigerInYourDream avatar Jun 21 '25 09:06 TigerInYourDream

Relogin does not restore the dock state, Do you want to fix it in this PR or in another PR?

As I said on the Robrix matrix channel, I don't think re-logging in should restore the dock state by default. Generally, when logging out, most users would expect that the app be "reset" to its default state.

However, in the future, we could show a prompt after log-in asking the user if they want to restore their previous dock/UI layout.

kevinaboos avatar Jun 27 '25 18:06 kevinaboos

Currently, the logout function still logs panic messages, but these do not cause the robrix program to crash.

thread 'tokio-runtime-worker' panicked at /Users/zhangalex/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/deadpool-runtime-0.1.4/src/lib.rs:77:58:
`JoinError` reason is not a panic.: JoinError::Cancelled(Id(18764))

This is because the matrix SDK does not implement a graceful shutdown mechanism for deadpool-runtime. As a result, after the main loop of robrix’s tokio runtime exits, deadpool-runtime continues to run. Eventually, when robrix is terminated with ctrl-c, crash messages will appear in the logs:

thread 'main' panicked at /Users/zhangalex/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/deadpool-runtime-0.1.4/src/lib.rs:101:22:
there is no reactor running, must be called from the context of a Tokio 1.x runtime
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread 'main' panicked at /Users/zhangalex/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/deadpool-runtime-0.1.4/src/lib.rs:101:22:
there is no reactor running, must be called from the context of a Tokio 1.x runtime
stack backtrace:
   0:        0x1041bef7c - std::backtrace_rs::backtrace::libunwind::trace::hd09570c029a6744a
   ...

@kevinaboos Do we need to report this issue in the matrix repository?

update: After the refactoring, these exceptions should all be resolved, so there’s no need to report them upstream.

ZhangHanDong avatar Jul 15 '25 11:07 ZhangHanDong

It sounds like we still have some major confusion about what is happening after logout, primarily when we're clearing/resetting a lot of key application states.

Please refrain from suggesting or implementing alternative solutions until we all clearly understand what the root/core problem is, i.e., the reason behind these "deadpool" panics. Kindly provide as much detailed information as possible about these panics, including backtraces, any knowledge you have, any discoveries or inferences you have made while debugging it. Also, please survey tokio docs and internet discussion forums for how to properly shut down these worker task pools; I'd be surprised if we're the first people to encounter an issue like this.

If this is a Matrix-specific problem, let's collect the relevant info here, and then we can present it to the Matrix SDK room(s) for them to help. But let's discuss it here first before we go to them. I want to make sure we're asking the right questions.

(Edited to add this):

To me, it feels like this is likely an ordering problem. By that, I mean that you're probably dropping some key Matrix state (e.g., the Client or something) before you have dropped all of the other things that depend on that client to exist. I haven't dug into it, but in general, we need to ensure that we're cleaning things up in the proper order. If that requires us to add this functionality into the Matrix SDK, then we should do that as well, and/or contact the SDK team to ask about this functionality.

kevinaboos avatar Jul 28 '25 20:07 kevinaboos

Logout Issues

Logout Issues

Before describing the logout issues, let's define two concepts:

  • logout: Clicking the logout button to exit robrix by interacting with matrix
  • shut down: Clicking x to exit the robrix process

The Earliest 'deadpool' Issue in Logout

https://github.com/project-robius/robrix/pull/432#discussion\_r2171202111 The commit hash at that time was e1f80ea The key code was as follows:

/// Logs out the current user and prepares the application for a new login session.
///
/// Performs server-side logout, cleans up client state, closes all tabs,
/// and restarts the Matrix runtime. Reports success or failure via LoginAction.
///
/// # Parameters
/// - is_desktop - Boolean indicating if the current UI mode is desktop (true) or mobile (false).
///
/// # Returns
/// - Ok(()) - Logout succeeded (possibly with cleanup warnings)
/// - Err(...) - Logout failed with detailed error
async fn logout_and_refresh(is_desktop :bool) -> Result<()> {
    // Collect all errors encountered during the logout process
    let mut errors = Vec::new();
    log!("Starting logout process...");
    let Some(client) = get_client() else {
        let error_msg = "Logout failed: No active client found";
        log!("Error: {}", error_msg);
        Cx::post_action(LogoutAction::LogoutFailure(error_msg.to_string()));
        return Err(anyhow::anyhow!(error_msg));
    };
    if !client.matrix_auth().logged_in() {
        let error_msg = "Client not logged in, skipping server-side logout";
        log!("Error: {}", error_msg);
        Cx::post_action(LogoutAction::LogoutFailure(error_msg.to_string()));
        return Err(anyhow::anyhow!(error_msg));
    }

    get_sync_service().unwrap().stop().await;

    log!("Performing server-side logout...");
    match tokio::time::timeout(tokio::time::Duration::from_secs(5), client.matrix_auth().logout()).await {
        Ok(Ok(_)) => {
            log!("Server-side logout successful.")
        },
        Ok(Err(e)) => {
            let error_msg = format!("Server-side logout failed: {}. Please try again later", e);
            log!("Error :{}", error_msg);
            Cx::post_action(LogoutAction::LogoutFailure(error_msg.to_string()));
            return Err(anyhow::anyhow!(error_msg));
        },
        Err(_) => {
            let error_msg = "Server-side logout timed out after 5 seconds. Please try again later";
            log!("Error: {}", error_msg);
            Cx::post_action(LogoutAction::LogoutFailure(error_msg.to_string()));
            return Err(anyhow::anyhow!(error_msg));
        },
    }

    // Clean up client state and caches
    log!("Cleaning up client state and caches...");
    CLIENT.lock().unwrap().take();
    SYNC_SERVICE.lock().unwrap().take();
    TOMBSTONED_ROOMS.lock().unwrap().clear();
    IGNORED_USERS.lock().unwrap().clear();
    DEFAULT_SSO_CLIENT.lock().unwrap().take();
    // Note: Taking REQUEST_SENDER closes the channel sender, causing the async_worker task to exit its loop
    // This triggers the "async_worker task ended unexpectedly" error in the monitor task, but this is expected during logout
    REQUEST_SENDER.lock().unwrap().take();
    log!("Client state and caches cleared after successful server logout.");

    // Desktop UI has tabs that must be properly closed, while mobile UI has no tabs concept.
    if is_desktop {
        log!("Requesting to close all tabs in desktop");
        let (tx, rx)  = oneshot::channel::<bool>();
        Cx::post_action(MainDesktopUiAction::CloseAllTabs { on_close_all: tx });
        match rx.await {
            Ok(_) => {
                log!("Received signal that the MainDesktopUI successfully closed all tabs");
            },
            Err(e)=> {
                let error_msg = format!("Close all tab failed {e}");
                log!("Error :{}", error_msg);
                Cx::post_action(LogoutAction::LogoutFailure(error_msg.to_string()));
                return Err(anyhow::anyhow!(error_msg));
            },
        }
    }

    log!("Deleting latest user ID file...");
    // We delete latest_user_id here for the following reasons:
    // 1. we delete the latest user ID such that Robrix won't auto-login the next time it starts,
    // 2. we don't delete the session file, such that the user could re-login using that session in the future.
    if let Err(e) = delete_latest_user_id().await {
        errors.push(e.to_string());
    }

    shutdown_background_tasks().await;
    // Restart the Matrix tokio runtime
    // This is a critical step; failure might prevent future logins
    log!("Restarting Matrix tokio runtime...");
    if start_matrix_tokio().is_err() {
        // Send failure notification and return immediately, as the runtime is fundamental
        let final_error_msg = String::from("Logout succeeded, but Robrix could not re-connect to the Matrix backend. Please exit and restart Robrix");
        Cx::post_action(LogoutAction::LogoutFailure(final_error_msg.clone()));
        return Err(anyhow::anyhow!(final_error_msg));
    }
    log!("Matrix tokio runtime restarted successfully.");

    // --- Final result handling ---
    if errors.is_empty() {
        // Complete success
        log!("Logout process completed successfully.");
        Cx::post_action(LogoutAction::LogoutSuccess);
        Ok(())
    } else {
        // Partial success (server logout ok, but cleanup errors)
        let warning_msg = format!(
            "Logout completed, but some cleanup operations failed: {}",
            errors.join("; ")
        );
        log!("Warning: {}", warning_msg);
        Cx::post_action(LogoutAction::LogoutSuccess);
        Ok(())
    }
}

async fn shutdown_background_tasks() {
    let mut rt_guard = TOKIO_RUNTIME.lock().unwrap();
    if let Some(existing_rt) = rt_guard.take() {
        existing_rt.shutdown_background();
    }
}

The operations that triggered the panic were:

  1. Desktop mode open multiple rooms.
  2. Switch to mobile.
  3. Logout in mobile.
  4. Enlarge the window large enough for desktop.
  5. Login again.
  6. (panic)

At that time, two issues were identified:

  • The memory state left by desktop was not properly handled during the logout process in mobile mode
  • According to the initial report, we can see that the program's backtrace showed: thread 'main' panicked at /Users/alanpoon/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/deadpool-runtime-0.1.4/src/lib.rs:101:22: there is no reactor running, must be called from the context of a Tokio 1.x runtime The matrix-sdk version at that time was 9676daee5ab864088993f869de52ec5d8b72cce9 Question 1 is not very relevant to the current analysis, as the handling omission was fixed in subsequent versions. We'll mainly discuss issue 2.

The core issues are as follows:

1. The most "dangerous" operation in the entire process is closing the tokio-runtime and then restarting a brand new runtime
2. The panic message appears in deadpool-runtime-0.1.4

Using cargo.lock for analysis, we get a dependency chain: matrix-sdk -> matrix-sdk-sqlite -> rusqlite -> deadpool-sqlite -> deadpool-runtime Since it's a tokio-related issue, we use tokio-console to analyze the problem Note: Frame 77 in https://github.com/project-robius/robrix/pull/432#discussion\_r2171202111 is related to the room_member_manager in the version at that time. (Alex later updated the code to remove room_member_manager, so this issue no longer exists in subsequent versions) I added a relatively long sleep after shutdown_background and before restart, observing the asynchronous tasks in tokio-console. I found that after calling shutdown_background, there were still many deadpool-related asynchronous tasks existing.

This is consistent with the tokio documentation, as shutdown_background will close the tokio-runtime but won't wait for the asynchronous tasks within it to finish. So we conclude that this panic occurs because we closed the tokio-runtime but the asynchronous tasks within it still exist, and when there's no runtime, the asynchronous tasks still try to execute, causing the panic.

So our problematic code is:

CLIENT.lock().unwrap().take();
.....
shutdown_background_tasks().await

When we CLIENT.lock().unwrap().take();, the matrix client starts to destruct, but before shutdown_background_tasks, there's no guarantee that all asynchronous tasks in the client have been fully processed, leading to panic.

Since the issue occurs in matrix-sdk, I read the matrix API trying to find corresponding cleanup logic to reclaim deadpool-runtime in advance, as the crash happens in deadpool-runtime. Unfortunately, matrix doesn't proactively provide an API to reclaim deadpool-runtime.

Design of the Logout State Machine

Note: Strictly speaking, the design of the state machine is not directly related to the crash, but the handling of this crash issue benefits from the design of the logout state machine.

//! The logout process is complex and error-prone due to:
//! - Network operations that can fail or timeout
//! - Resource cleanup that must happen in specific order
//! - UI synchronization across desktop tabs
//! - Matrix SDK objects that can panic during destruction
//! - Need for progress feedback and cancellation support
//!
//! ## State Flow
//!
//!
//! Idle (0%) → PreChecking (10%) → StoppingSyncService (20%) → LoggingOutFromServer (30%)
//!     ↓                                                                    ↓
//!   Failed ←─────────────────────────────────────────────────────── PointOfNoReturn (50%) ⚠️
//!                                                                           ↓
//!                                                                   ClosingTabs (60%) [Desktop Only]
//!                                                                           ↓
//!                                                                   CleaningAppState (70%)
//!                                                                           ↓
//!                                                                   ShuttingDownTasks (80%)
//!                                                                           ↓
//!                                                                   RestartingRuntime (90%)
//!                                                                           ↓
//!                                                                     Completed (100%)
//!                                                                           ↓
//!                                                                        Failed
//! ```
//!
//! ## Critical Design Points

The code above illustrates the state transition diagram of the state machine.

Now the core logout code is as follows:

/// Execute the logout process
    pub async fn execute(&self) -> Result<()> {
        log!("LogoutStateMachine::execute() started");
        
        // Set logout in progress flag
        LOGOUT_IN_PROGRESS.store(true, Ordering::Relaxed);
        
        // Reset global point of no return flag
        LOGOUT_POINT_OF_NO_RETURN.store(false, Ordering::Relaxed);
        
        // Start from Idle state
        self.transition_to(
            LogoutState::PreChecking,
            "Checking prerequisites...".to_string(),
            10
        ).await?;
        
        // Pre-checks
        if let Err(e) = self.perform_prechecks().await {
            self.transition_to(
                LogoutState::Failed(e.clone()),
                format!("Precheck failed: {}", e),
                0
            ).await?;
            self.handle_error(&e).await;
            return Err(anyhow!(e));
        }
        
        // Stop sync service
        self.transition_to(
            LogoutState::StoppingSyncService,
            "Stopping sync service...".to_string(),
            20
        ).await?;
        
        if let Err(e) = self.stop_sync_service().await {
            self.transition_to(
                LogoutState::Failed(e.clone()),
                format!("Failed to stop sync service: {}", e),
                0
            ).await?;
            self.handle_error(&e).await;
            return Err(anyhow!(e));
        }
        
        // Server logout
        self.transition_to(
            LogoutState::LoggingOutFromServer,
            "Logging out from server...".to_string(),
            30
        ).await?;
        
        match self.perform_server_logout().await {
            Ok(_) => {
                self.point_of_no_return.store(true, Ordering::Release);
                LOGOUT_POINT_OF_NO_RETURN.store(true, Ordering::Release);
                self.transition_to(
                    LogoutState::PointOfNoReturn,
                    "Point of no return reached".to_string(),
                    50
                ).await?;
                
                // We delete latest_user_id after reaching LOGOUT_POINT_OF_NO_RETURN:
                // 1. To prevent auto-login with invalid session on next start
                // 2. While keeping session file intact for potential future login
                if let Err(e) = delete_latest_user_id().await {
                    log!("Warning: Failed to delete latest user ID: {}", e);
                }
            }
            Err(e) => {
                // Check if it's an M_UNKNOWN_TOKEN error
                if matches!(&e, LogoutError::Recoverable(RecoverableError::ServerLogoutFailed(msg)) if msg.contains("M_UNKNOWN_TOKEN")) {
                    log!("Token already invalidated, continuing with logout");
                    self.point_of_no_return.store(true, Ordering::Release);
                    LOGOUT_POINT_OF_NO_RETURN.store(true, Ordering::Release);
                    self.transition_to(
                        LogoutState::PointOfNoReturn,
                        "Token already invalidated".to_string(),
                        50
                    ).await?;
                    
                    // Same delete operation as in the success case above
                    if let Err(e) = delete_latest_user_id().await {
                        log!("Warning: Failed to delete latest user ID: {}", e);
                    }
                } else {
                    // Restart sync service since we haven't reached point of no return
                    if let Some(sync_service) = get_sync_service() {
                        sync_service.start().await;
                    }
                    
                    self.transition_to(
                        LogoutState::Failed(e.clone()),
                        format!("Server logout failed: {}", e),
                        0
                    ).await?;
                    self.handle_error(&e).await;
                    return Err(anyhow!(e));
                }
            }
        }
        
        // From here on, all failures are unrecoverable
        
        // Close tabs (desktop only)
        if self.config.is_desktop {
            self.transition_to(
                LogoutState::ClosingTabs,
                "Closing all tabs...".to_string(),
                60
            ).await?;
            
            if let Err(e) = self.close_all_tabs().await {
                let error = LogoutError::Unrecoverable(UnrecoverableError::PostPointOfNoReturnFailure(e.to_string()));
                self.transition_to(
                    LogoutState::Failed(error.clone()),
                    "Failed to close tabs".to_string(),
                    0
                ).await?;
                self.handle_error(&error).await;
                return Err(anyhow!(error));
            }
        }
        
        // Clean app state
        self.transition_to(
            LogoutState::CleaningAppState,
            "Cleaning up application state...".to_string(),
            70
        ).await?;
        
        // All static resources (CLIENT, SYNC_SERVICE, etc.) are defined in the sliding_sync module,
        // so the state machine delegates the cleanup operation to sliding_sync's clean_app_state function
        // rather than accessing these static variables directly from outside the module.
        if let Err(e) = clean_app_state(&self.config).await {
            let error = LogoutError::Unrecoverable(UnrecoverableError::PostPointOfNoReturnFailure(e.to_string()));
            self.transition_to(
                LogoutState::Failed(error.clone()),
                "Failed to clean app state".to_string(),
                0
            ).await?;
            self.handle_error(&error).await;
            return Err(anyhow!(error));
        }
        
        // Shutdown tasks
        self.transition_to(
            LogoutState::ShuttingDownTasks,
            "Shutting down background tasks...".to_string(),
            80
        ).await?;
        
        self.shutdown_background_tasks().await;
        
        // Restart runtime
        self.transition_to(
            LogoutState::RestartingRuntime,
            "Restarting Matrix runtime...".to_string(),
            90
        ).await?;
        
        if let Err(e) = self.restart_runtime().await {
            let error = LogoutError::Unrecoverable(UnrecoverableError::RuntimeRestartFailed);
            self.transition_to(
                LogoutState::Failed(error.clone()),
                format!("Failed to restart runtime: {}", e),
                0
            ).await?;
            self.handle_error(&error).await;
            return Err(anyhow!(error));
        }
        
        // Success!
        self.transition_to(
            LogoutState::Completed,
            "Logout completed successfully".to_string(),
            100
        ).await?;

        // CloseSetting after logout
        Cx::post_action(SettingsAction::CloseSettings);

        // Reset logout in progress flag
        LOGOUT_IN_PROGRESS.store(false, Ordering::Relaxed);
        Cx::post_action(LogoutAction::LogoutSuccess);
        Ok(())
    }

Additionally, here's the supplementary code:

pub async fn clean_app_state(config: &LogoutConfig) -> Result<()> {
    // Clear resources normally, allowing them to be properly dropped
    // This prevents memory leaks when users logout and login again without closing the app
    CLIENT.lock().unwrap().take();
    log!("Client cleared during logout");

    SYNC_SERVICE.lock().unwrap().take();
    log!("Sync service cleared during logout");

    REQUEST_SENDER.lock().unwrap().take();
    log!("Request sender cleared during logout");

    // Only clear collections that don't contain Matrix SDK objects
    TOMBSTONED_ROOMS.lock().unwrap().clear();
    IGNORED_USERS.lock().unwrap().clear();
    ALL_JOINED_ROOMS.lock().unwrap().clear();

    let (tx, rx) = oneshot::channel::<bool>();
    Cx::post_action(LogoutAction::CleanAppState { on_clean_appstate: tx });

    match tokio::time::timeout(config.app_state_cleanup_timeout, rx).await {
        Ok(Ok(_)) => {
            log!("Received signal that app state was cleaned successfully");
            Ok(())
        }
        Ok(Err(e)) => Err(anyhow!("Failed to clean app state: {}", e)),
        Err(_) => Err(anyhow!("Timed out waiting for app state cleanup")),
    }
}

Now, the reclamation of tokio asynchronous tasks, especially regarding deadpool-runtime in matrix, is handled between clean_app_state and shutdown_background_tasks. In the new state machine, we have more operations between CLIENT.lock().unwrap().take(); (start destructing asynchronous tasks in matrix) and shutdown_background_tasks, giving the program more time for destruction. Note: We still don't have an API to actively handle deadpool-runtime in matrix.

Now, there are no more panics related to deadpool in the code.

As a comparison:

pub async fn clean_app_state(config: &LogoutConfig) -> Result<()> {
    // Clear resources normally, allowing them to be properly dropped
    // This prevents memory leaks when users logout and login again without closing the app
    CLIENT.lock().unwrap().take();
    log!("Client cleared during logout");

    SYNC_SERVICE.lock().unwrap().take();
    log!("Sync service cleared during logout");

    REQUEST_SENDER.lock().unwrap().take();
    log!("Request sender cleared during logout");

    // Only clear collections that don't contain Matrix SDK objects
    TOMBSTONED_ROOMS.lock().unwrap().clear();
    IGNORED_USERS.lock().unwrap().clear();
    ALL_JOINED_ROOMS.lock().unwrap().clear();

    // let (tx, rx) = oneshot::channel::<bool>();
    // Cx::post_action(LogoutAction::CleanAppState { on_clean_appstate: tx });

    // match tokio::time::timeout(config.app_state_cleanup_timeout, rx).await {
    //     Ok(Ok(_)) => {
    //         log!("Received signal that app state was cleaned successfully");
    //         Ok(())
    //     }
    //     Ok(Err(e)) => Err(anyhow!("Failed to clean app state: {}", e)),
    //     Err(_) => Err(anyhow!("Timed out waiting for app state cleanup")),
    // }

    Ok(())
}

If we comment out the code as shown above, canceling this time-consuming operation, and then logout, we can still see panics related to deadpool-runtime.

About Shutdown and Leak

The above is an analysis of logout. As an exit function, I feel obligated to address issues in shutdown (closing the process).

Through the above analysis, I have reached the conclusion:

The deadpool-runtime panic that occurs during the exit process is due to not fully ending asynchronous tasks before closing the runtime. Or, it's entirely an issue of destruction order.

In the logout state machine, because we adjusted the order of the state machine, we gave more time for the client to destruct when it's being destructed, avoiding the deadpool-runtime panic. However, we don't have a state machine to perform such a series of operations during shutdown, and we don't have an API to pre-release deadpool-runtime in the matrix client. So if we directly shutdown robrix, this crash information will point to the robrix program.

Therefore, I believe that since we have no way to release asynchronous tasks in the program, we might as well directly forget them, especially forgetting the tokio-runtime. Since the program is about to shutdown, forgetting might be acceptable.

This design is the core idea of using leak in shutdown.

Current Situation

The current situation is a bit interesting. In the current version, I tried:

fn cleanup_before_shutdown(&mut self) {

    log!("Starting shutdown cleanup...");

    // Clear user profile cache first to prevent thread-local destructor issues
    // This must be done before leaking the tokio runtime
    clear_all_caches();
    log!("Cleared user profile cache");

    // Set logout in progress to suppress error messages
    LOGOUT_IN_PROGRESS.store(true, Ordering::Relaxed);

    // Immediately take and leak all resources to prevent any destructors from running
    // This is a controlled leak at shutdown to avoid the deadpool panic

    // Take the runtime first and leak it
    leak_runtime();

    // Take and leak the client
    leak_client();

    // Take and leak the sync service
    leak_sync_service();

    // Take and leak the request sender
    leak_request_sender();

    // Don't clear any collections or caches as they might contain references
    // to Matrix SDK objects that would trigger the deadpool panic
    log!("Shutdown cleanup completed - all resources leaked to prevent panics");
}

This code is in src/app.rs. I now try not to call this method, which means not actively leaking these resources during shutdown, and the program still doesn't crash. (Note that I did encounter crash errors in previous versions)

Here are some possible analyses:

Change in execution order:

Previous issue: shutdown → clear_all_caches() → Matrix SDK objects in cache destruct → access potentially closed runtime → crash

Current safe process: logout → clean_app_state() → Matrix SDK core resources already cleared → clear_all_caches() → safe shutdown → do nothing → operating system reclaims resources → safe

  1. Key contribution of the state machine
  2. Separated the handling logic of logout and shutdown
  3. Ensured correct cleanup order (clear dependencies first, then clear dependents)
  4. Provided clear timing guarantees (perform dangerous operations only when it's safe)

Leaking is indeed a dangerous operation. Of course, the current situation suggests that the previous code might have been over-protective, and we might consider removing the related code.

TigerInYourDream avatar Jul 31 '25 02:07 TigerInYourDream

Great, glad you were able to find a non-leak solution. I figured it was just an ordering problem.

There are still many outstanding comments from my last review round that haven't been addressed. Can you address each of those first, and then re-request a review from me once everything has been resolved? Thanks.

kevinaboos avatar Aug 02 '25 21:08 kevinaboos

Logout timeout is not properly tested with MITM tools.

https://github.com/user-attachments/assets/8765793f-8d80-4988-966f-501b5b0493a5

  1. Logout progress is stuck at 30%.
  2. The cancel button cannot be pressed while waiting for matrix logout response.

After logout timeout.

  • Unable to send out messages.

alanpoon avatar Aug 05 '25 02:08 alanpoon

Logout Issues

Logout Issues

Before describing the logout issues, let's define two concepts:

  • logout: Clicking the logout button to exit robrix by interacting with matrix
  • shut down: Clicking x to exit the robrix process

The Earliest 'deadpool' Issue in Logout

https://github.com/project-robius/robrix/pull/432#discussion_r2171202111 The commit hash at that time was e1f80ea The key code was as follows:

/// Logs out the current user and prepares the application for a new login session.
///
/// Performs server-side logout, cleans up client state, closes all tabs,
/// and restarts the Matrix runtime. Reports success or failure via LoginAction.
///
/// # Parameters
/// - is_desktop - Boolean indicating if the current UI mode is desktop (true) or mobile (false).
///
/// # Returns
/// - Ok(()) - Logout succeeded (possibly with cleanup warnings)
/// - Err(...) - Logout failed with detailed error
async fn logout_and_refresh(is_desktop :bool) -> Result<()> {
    // Collect all errors encountered during the logout process
    let mut errors = Vec::new();
    log!("Starting logout process...");
    let Some(client) = get_client() else {
        let error_msg = "Logout failed: No active client found";
        log!("Error: {}", error_msg);
        Cx::post_action(LogoutAction::LogoutFailure(error_msg.to_string()));
        return Err(anyhow::anyhow!(error_msg));
    };
    if !client.matrix_auth().logged_in() {
        let error_msg = "Client not logged in, skipping server-side logout";
        log!("Error: {}", error_msg);
        Cx::post_action(LogoutAction::LogoutFailure(error_msg.to_string()));
        return Err(anyhow::anyhow!(error_msg));
    }

    get_sync_service().unwrap().stop().await;

    log!("Performing server-side logout...");
    match tokio::time::timeout(tokio::time::Duration::from_secs(5), client.matrix_auth().logout()).await {
        Ok(Ok(_)) => {
            log!("Server-side logout successful.")
        },
        Ok(Err(e)) => {
            let error_msg = format!("Server-side logout failed: {}. Please try again later", e);
            log!("Error :{}", error_msg);
            Cx::post_action(LogoutAction::LogoutFailure(error_msg.to_string()));
            return Err(anyhow::anyhow!(error_msg));
        },
        Err(_) => {
            let error_msg = "Server-side logout timed out after 5 seconds. Please try again later";
            log!("Error: {}", error_msg);
            Cx::post_action(LogoutAction::LogoutFailure(error_msg.to_string()));
            return Err(anyhow::anyhow!(error_msg));
        },
    }

    // Clean up client state and caches
    log!("Cleaning up client state and caches...");
    CLIENT.lock().unwrap().take();
    SYNC_SERVICE.lock().unwrap().take();
    TOMBSTONED_ROOMS.lock().unwrap().clear();
    IGNORED_USERS.lock().unwrap().clear();
    DEFAULT_SSO_CLIENT.lock().unwrap().take();
    // Note: Taking REQUEST_SENDER closes the channel sender, causing the async_worker task to exit its loop
    // This triggers the "async_worker task ended unexpectedly" error in the monitor task, but this is expected during logout
    REQUEST_SENDER.lock().unwrap().take();
    log!("Client state and caches cleared after successful server logout.");

    // Desktop UI has tabs that must be properly closed, while mobile UI has no tabs concept.
    if is_desktop {
        log!("Requesting to close all tabs in desktop");
        let (tx, rx)  = oneshot::channel::<bool>();
        Cx::post_action(MainDesktopUiAction::CloseAllTabs { on_close_all: tx });
        match rx.await {
            Ok(_) => {
                log!("Received signal that the MainDesktopUI successfully closed all tabs");
            },
            Err(e)=> {
                let error_msg = format!("Close all tab failed {e}");
                log!("Error :{}", error_msg);
                Cx::post_action(LogoutAction::LogoutFailure(error_msg.to_string()));
                return Err(anyhow::anyhow!(error_msg));
            },
        }
    }

    log!("Deleting latest user ID file...");
    // We delete latest_user_id here for the following reasons:
    // 1. we delete the latest user ID such that Robrix won't auto-login the next time it starts,
    // 2. we don't delete the session file, such that the user could re-login using that session in the future.
    if let Err(e) = delete_latest_user_id().await {
        errors.push(e.to_string());
    }

    shutdown_background_tasks().await;
    // Restart the Matrix tokio runtime
    // This is a critical step; failure might prevent future logins
    log!("Restarting Matrix tokio runtime...");
    if start_matrix_tokio().is_err() {
        // Send failure notification and return immediately, as the runtime is fundamental
        let final_error_msg = String::from("Logout succeeded, but Robrix could not re-connect to the Matrix backend. Please exit and restart Robrix");
        Cx::post_action(LogoutAction::LogoutFailure(final_error_msg.clone()));
        return Err(anyhow::anyhow!(final_error_msg));
    }
    log!("Matrix tokio runtime restarted successfully.");

    // --- Final result handling ---
    if errors.is_empty() {
        // Complete success
        log!("Logout process completed successfully.");
        Cx::post_action(LogoutAction::LogoutSuccess);
        Ok(())
    } else {
        // Partial success (server logout ok, but cleanup errors)
        let warning_msg = format!(
            "Logout completed, but some cleanup operations failed: {}",
            errors.join("; ")
        );
        log!("Warning: {}", warning_msg);
        Cx::post_action(LogoutAction::LogoutSuccess);
        Ok(())
    }
}

async fn shutdown_background_tasks() {
    let mut rt_guard = TOKIO_RUNTIME.lock().unwrap();
    if let Some(existing_rt) = rt_guard.take() {
        existing_rt.shutdown_background();
    }
}

The operations that triggered the panic were:

  1. Desktop mode open multiple rooms.
  2. Switch to mobile.
  3. Logout in mobile.
  4. Enlarge the window large enough for desktop.
  5. Login again.
  6. (panic)

At that time, two issues were identified:

  • The memory state left by desktop was not properly handled during the logout process in mobile mode
  • According to the initial report, we can see that the program's backtrace showed: thread 'main' panicked at /Users/alanpoon/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/deadpool-runtime-0.1.4/src/lib.rs:101:22: there is no reactor running, must be called from the context of a Tokio 1.x runtime The matrix-sdk version at that time was 9676daee5ab864088993f869de52ec5d8b72cce9 Question 1 is not very relevant to the current analysis, as the handling omission was fixed in subsequent versions. We'll mainly discuss issue 2.

The core issues are as follows:

1. The most "dangerous" operation in the entire process is closing the tokio-runtime and then restarting a brand new runtime
2. The panic message appears in deadpool-runtime-0.1.4

Using cargo.lock for analysis, we get a dependency chain: matrix-sdk -> matrix-sdk-sqlite -> rusqlite -> deadpool-sqlite -> deadpool-runtime Since it's a tokio-related issue, we use tokio-console to analyze the problem Note: Frame 77 in https://github.com/project-robius/robrix/pull/432#discussion_r2171202111 is related to the room_member_manager in the version at that time. (Alex later updated the code to remove room_member_manager, so this issue no longer exists in subsequent versions) I added a relatively long sleep after shutdown_background and before restart, observing the asynchronous tasks in tokio-console. I found that after calling shutdown_background, there were still many deadpool-related asynchronous tasks existing. This is consistent with the tokio documentation, as shutdown_background will close the tokio-runtime but won't wait for the asynchronous tasks within it to finish. So we conclude that this panic occurs because we closed the tokio-runtime but the asynchronous tasks within it still exist, and when there's no runtime, the asynchronous tasks still try to execute, causing the panic. So our problematic code is:

CLIENT.lock().unwrap().take();
.....
shutdown_background_tasks().await

When we CLIENT.lock().unwrap().take();, the matrix client starts to destruct, but before shutdown_background_tasks, there's no guarantee that all asynchronous tasks in the client have been fully processed, leading to panic. Since the issue occurs in matrix-sdk, I read the matrix API trying to find corresponding cleanup logic to reclaim deadpool-runtime in advance, as the crash happens in deadpool-runtime. Unfortunately, matrix doesn't proactively provide an API to reclaim deadpool-runtime.

Design of the Logout State Machine

Note: Strictly speaking, the design of the state machine is not directly related to the crash, but the handling of this crash issue benefits from the design of the logout state machine.

//! The logout process is complex and error-prone due to:
//! - Network operations that can fail or timeout
//! - Resource cleanup that must happen in specific order
//! - UI synchronization across desktop tabs
//! - Matrix SDK objects that can panic during destruction
//! - Need for progress feedback and cancellation support
//!
//! ## State Flow
//!
//!
//! Idle (0%) → PreChecking (10%) → StoppingSyncService (20%) → LoggingOutFromServer (30%)
//!     ↓                                                                    ↓
//!   Failed ←─────────────────────────────────────────────────────── PointOfNoReturn (50%) ⚠️
//!                                                                           ↓
//!                                                                   ClosingTabs (60%) [Desktop Only]
//!                                                                           ↓
//!                                                                   CleaningAppState (70%)
//!                                                                           ↓
//!                                                                   ShuttingDownTasks (80%)
//!                                                                           ↓
//!                                                                   RestartingRuntime (90%)
//!                                                                           ↓
//!                                                                     Completed (100%)
//!                                                                           ↓
//!                                                                        Failed
//! ```
//!
//! ## Critical Design Points

The code above illustrates the state transition diagram of the state machine. Now the core logout code is as follows:

/// Execute the logout process
    pub async fn execute(&self) -> Result<()> {
        log!("LogoutStateMachine::execute() started");
        
        // Set logout in progress flag
        LOGOUT_IN_PROGRESS.store(true, Ordering::Relaxed);
        
        // Reset global point of no return flag
        LOGOUT_POINT_OF_NO_RETURN.store(false, Ordering::Relaxed);
        
        // Start from Idle state
        self.transition_to(
            LogoutState::PreChecking,
            "Checking prerequisites...".to_string(),
            10
        ).await?;
        
        // Pre-checks
        if let Err(e) = self.perform_prechecks().await {
            self.transition_to(
                LogoutState::Failed(e.clone()),
                format!("Precheck failed: {}", e),
                0
            ).await?;
            self.handle_error(&e).await;
            return Err(anyhow!(e));
        }
        
        // Stop sync service
        self.transition_to(
            LogoutState::StoppingSyncService,
            "Stopping sync service...".to_string(),
            20
        ).await?;
        
        if let Err(e) = self.stop_sync_service().await {
            self.transition_to(
                LogoutState::Failed(e.clone()),
                format!("Failed to stop sync service: {}", e),
                0
            ).await?;
            self.handle_error(&e).await;
            return Err(anyhow!(e));
        }
        
        // Server logout
        self.transition_to(
            LogoutState::LoggingOutFromServer,
            "Logging out from server...".to_string(),
            30
        ).await?;
        
        match self.perform_server_logout().await {
            Ok(_) => {
                self.point_of_no_return.store(true, Ordering::Release);
                LOGOUT_POINT_OF_NO_RETURN.store(true, Ordering::Release);
                self.transition_to(
                    LogoutState::PointOfNoReturn,
                    "Point of no return reached".to_string(),
                    50
                ).await?;
                
                // We delete latest_user_id after reaching LOGOUT_POINT_OF_NO_RETURN:
                // 1. To prevent auto-login with invalid session on next start
                // 2. While keeping session file intact for potential future login
                if let Err(e) = delete_latest_user_id().await {
                    log!("Warning: Failed to delete latest user ID: {}", e);
                }
            }
            Err(e) => {
                // Check if it's an M_UNKNOWN_TOKEN error
                if matches!(&e, LogoutError::Recoverable(RecoverableError::ServerLogoutFailed(msg)) if msg.contains("M_UNKNOWN_TOKEN")) {
                    log!("Token already invalidated, continuing with logout");
                    self.point_of_no_return.store(true, Ordering::Release);
                    LOGOUT_POINT_OF_NO_RETURN.store(true, Ordering::Release);
                    self.transition_to(
                        LogoutState::PointOfNoReturn,
                        "Token already invalidated".to_string(),
                        50
                    ).await?;
                    
                    // Same delete operation as in the success case above
                    if let Err(e) = delete_latest_user_id().await {
                        log!("Warning: Failed to delete latest user ID: {}", e);
                    }
                } else {
                    // Restart sync service since we haven't reached point of no return
                    if let Some(sync_service) = get_sync_service() {
                        sync_service.start().await;
                    }
                    
                    self.transition_to(
                        LogoutState::Failed(e.clone()),
                        format!("Server logout failed: {}", e),
                        0
                    ).await?;
                    self.handle_error(&e).await;
                    return Err(anyhow!(e));
                }
            }
        }
        
        // From here on, all failures are unrecoverable
        
        // Close tabs (desktop only)
        if self.config.is_desktop {
            self.transition_to(
                LogoutState::ClosingTabs,
                "Closing all tabs...".to_string(),
                60
            ).await?;
            
            if let Err(e) = self.close_all_tabs().await {
                let error = LogoutError::Unrecoverable(UnrecoverableError::PostPointOfNoReturnFailure(e.to_string()));
                self.transition_to(
                    LogoutState::Failed(error.clone()),
                    "Failed to close tabs".to_string(),
                    0
                ).await?;
                self.handle_error(&error).await;
                return Err(anyhow!(error));
            }
        }
        
        // Clean app state
        self.transition_to(
            LogoutState::CleaningAppState,
            "Cleaning up application state...".to_string(),
            70
        ).await?;
        
        // All static resources (CLIENT, SYNC_SERVICE, etc.) are defined in the sliding_sync module,
        // so the state machine delegates the cleanup operation to sliding_sync's clean_app_state function
        // rather than accessing these static variables directly from outside the module.
        if let Err(e) = clean_app_state(&self.config).await {
            let error = LogoutError::Unrecoverable(UnrecoverableError::PostPointOfNoReturnFailure(e.to_string()));
            self.transition_to(
                LogoutState::Failed(error.clone()),
                "Failed to clean app state".to_string(),
                0
            ).await?;
            self.handle_error(&error).await;
            return Err(anyhow!(error));
        }
        
        // Shutdown tasks
        self.transition_to(
            LogoutState::ShuttingDownTasks,
            "Shutting down background tasks...".to_string(),
            80
        ).await?;
        
        self.shutdown_background_tasks().await;
        
        // Restart runtime
        self.transition_to(
            LogoutState::RestartingRuntime,
            "Restarting Matrix runtime...".to_string(),
            90
        ).await?;
        
        if let Err(e) = self.restart_runtime().await {
            let error = LogoutError::Unrecoverable(UnrecoverableError::RuntimeRestartFailed);
            self.transition_to(
                LogoutState::Failed(error.clone()),
                format!("Failed to restart runtime: {}", e),
                0
            ).await?;
            self.handle_error(&error).await;
            return Err(anyhow!(error));
        }
        
        // Success!
        self.transition_to(
            LogoutState::Completed,
            "Logout completed successfully".to_string(),
            100
        ).await?;

        // CloseSetting after logout
        Cx::post_action(SettingsAction::CloseSettings);

        // Reset logout in progress flag
        LOGOUT_IN_PROGRESS.store(false, Ordering::Relaxed);
        Cx::post_action(LogoutAction::LogoutSuccess);
        Ok(())
    }

Additionally, here's the supplementary code:

pub async fn clean_app_state(config: &LogoutConfig) -> Result<()> {
    // Clear resources normally, allowing them to be properly dropped
    // This prevents memory leaks when users logout and login again without closing the app
    CLIENT.lock().unwrap().take();
    log!("Client cleared during logout");

    SYNC_SERVICE.lock().unwrap().take();
    log!("Sync service cleared during logout");

    REQUEST_SENDER.lock().unwrap().take();
    log!("Request sender cleared during logout");

    // Only clear collections that don't contain Matrix SDK objects
    TOMBSTONED_ROOMS.lock().unwrap().clear();
    IGNORED_USERS.lock().unwrap().clear();
    ALL_JOINED_ROOMS.lock().unwrap().clear();

    let (tx, rx) = oneshot::channel::<bool>();
    Cx::post_action(LogoutAction::CleanAppState { on_clean_appstate: tx });

    match tokio::time::timeout(config.app_state_cleanup_timeout, rx).await {
        Ok(Ok(_)) => {
            log!("Received signal that app state was cleaned successfully");
            Ok(())
        }
        Ok(Err(e)) => Err(anyhow!("Failed to clean app state: {}", e)),
        Err(_) => Err(anyhow!("Timed out waiting for app state cleanup")),
    }
}

Now, the reclamation of tokio asynchronous tasks, especially regarding deadpool-runtime in matrix, is handled between clean_app_state and shutdown_background_tasks. In the new state machine, we have more operations between CLIENT.lock().unwrap().take(); (start destructing asynchronous tasks in matrix) and shutdown_background_tasks, giving the program more time for destruction. Note: We still don't have an API to actively handle deadpool-runtime in matrix. Now, there are no more panics related to deadpool in the code. As a comparison:

pub async fn clean_app_state(config: &LogoutConfig) -> Result<()> {
    // Clear resources normally, allowing them to be properly dropped
    // This prevents memory leaks when users logout and login again without closing the app
    CLIENT.lock().unwrap().take();
    log!("Client cleared during logout");

    SYNC_SERVICE.lock().unwrap().take();
    log!("Sync service cleared during logout");

    REQUEST_SENDER.lock().unwrap().take();
    log!("Request sender cleared during logout");

    // Only clear collections that don't contain Matrix SDK objects
    TOMBSTONED_ROOMS.lock().unwrap().clear();
    IGNORED_USERS.lock().unwrap().clear();
    ALL_JOINED_ROOMS.lock().unwrap().clear();

    // let (tx, rx) = oneshot::channel::<bool>();
    // Cx::post_action(LogoutAction::CleanAppState { on_clean_appstate: tx });

    // match tokio::time::timeout(config.app_state_cleanup_timeout, rx).await {
    //     Ok(Ok(_)) => {
    //         log!("Received signal that app state was cleaned successfully");
    //         Ok(())
    //     }
    //     Ok(Err(e)) => Err(anyhow!("Failed to clean app state: {}", e)),
    //     Err(_) => Err(anyhow!("Timed out waiting for app state cleanup")),
    // }

    Ok(())
}

If we comment out the code as shown above, canceling this time-consuming operation, and then logout, we can still see panics related to deadpool-runtime.

About Shutdown and Leak

The above is an analysis of logout. As an exit function, I feel obligated to address issues in shutdown (closing the process). Through the above analysis, I have reached the conclusion:

The deadpool-runtime panic that occurs during the exit process is due to not fully ending asynchronous tasks before closing the runtime. Or, it's entirely an issue of destruction order.

In the logout state machine, because we adjusted the order of the state machine, we gave more time for the client to destruct when it's being destructed, avoiding the deadpool-runtime panic. However, we don't have a state machine to perform such a series of operations during shutdown, and we don't have an API to pre-release deadpool-runtime in the matrix client. So if we directly shutdown robrix, this crash information will point to the robrix program. Therefore, I believe that since we have no way to release asynchronous tasks in the program, we might as well directly forget them, especially forgetting the tokio-runtime. Since the program is about to shutdown, forgetting might be acceptable. This design is the core idea of using leak in shutdown.

Current Situation

The current situation is a bit interesting. In the current version, I tried:

fn cleanup_before_shutdown(&mut self) {

    log!("Starting shutdown cleanup...");

    // Clear user profile cache first to prevent thread-local destructor issues
    // This must be done before leaking the tokio runtime
    clear_all_caches();
    log!("Cleared user profile cache");

    // Set logout in progress to suppress error messages
    LOGOUT_IN_PROGRESS.store(true, Ordering::Relaxed);

    // Immediately take and leak all resources to prevent any destructors from running
    // This is a controlled leak at shutdown to avoid the deadpool panic

    // Take the runtime first and leak it
    leak_runtime();

    // Take and leak the client
    leak_client();

    // Take and leak the sync service
    leak_sync_service();

    // Take and leak the request sender
    leak_request_sender();

    // Don't clear any collections or caches as they might contain references
    // to Matrix SDK objects that would trigger the deadpool panic
    log!("Shutdown cleanup completed - all resources leaked to prevent panics");
}

This code is in src/app.rs. I now try not to call this method, which means not actively leaking these resources during shutdown, and the program still doesn't crash. (Note that I did encounter crash errors in previous versions) Here are some possible analyses: Change in execution order: Previous issue: shutdown → clear_all_caches() → Matrix SDK objects in cache destruct → access potentially closed runtime → crash Current safe process: logout → clean_app_state() → Matrix SDK core resources already cleared → clear_all_caches() → safe shutdown → do nothing → operating system reclaims resources → safe

  1. Key contribution of the state machine
  2. Separated the handling logic of logout and shutdown
  3. Ensured correct cleanup order (clear dependencies first, then clear dependents)
  4. Provided clear timing guarantees (perform dangerous operations only when it's safe)

Leaking is indeed a dangerous operation. Of course, the current situation suggests that the previous code might have been over-protective, and we might consider removing the related code.

I see that you are making some progress. Just to be clear, CLIENT.lock().unwrap().take(); deallocating Client does nothing to the underlying Matrix inner client as Client is a cheap clone of Atomic reference to inner_client.

alanpoon avatar Aug 05 '25 03:08 alanpoon

.

Logout timeout is not properly tested with MITM tools.

Screen.Recording.2025-08-05.at.10.42.44.AM.mov

  1. Logout progress is stuck at 30%.
  2. The cancel button cannot be pressed while waiting for matrix logout response.

After logout timeout.

  • Unable to send out messages.

I have improved UX. you can click abort during logout

TigerInYourDream avatar Aug 07 '25 10:08 TigerInYourDream