mistral.rs Memory leak and channel closure issues when reusing/dropping Model

trafficstars

Describe the bug

When initializing and dropping the Model repeatedly:

Memory usage continuously increases as GGUF models aren't properly cleaned up
Channel is erroneously closed after the first iteration

Steps to Reproduce

Create a service that initializes and drops the model multiple times
Run the following code:

use anyhow::Result;
use mistralrs::{GgufModelBuilder, PagedAttentionMetaBuilder, TextMessageRole, TextMessages};
use std::time::Duration;
use tokio::time::sleep;

struct ChatService {
    model: Option<mistralrs::Model>,
}

impl ChatService {
    async fn new() -> Result<Self> {
        Ok(Self { model: None })
    }

    async fn initialize_model(&mut self) -> Result<()> {
        self.model = Some(
            GgufModelBuilder::new(
                "gguf_models/mistral_v0.1/",
                vec!["mistral-7b-instruct-v0.1.Q4_K_M.gguf"],
            )
            .with_chat_template("chat_templates/mistral.json")
            .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
            .build()
            .await?,
        );
        Ok(())
    }

    async fn chat(&self, prompt: &str) -> Result<String> {
        let messages = TextMessages::new().add_message(TextMessageRole::User, prompt);

        let response = self
            .model
            .as_ref()
            .unwrap()
            .send_chat_request(messages)
            .await?;

        Ok(response.choices[0]
            .message
            .content
            .clone()
            .unwrap_or_default())
    }
}

#[tokio::main]
async fn main() -> Result<()> {
    for i in 0..3 {
        println!("Iteration {}", i);

        let mut service = ChatService::new().await?;
        service.initialize_model().await?;

        let response = service.chat("Write a short greeting").await?;
        println!("Response: {}", response);

        // Model is dropped here, but GGUF remains in memory
        drop(service);

        // Wait to make memory usage observable
        sleep(Duration::from_secs(5)).await;
    }

    Ok(())
}

Cargo.toml is here:

[package]
name = "memory_bug_mistral"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
tokio = { version = "1", features = ["full"] }
anyhow = "1.0"
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", branch = "master", features = [
    "metal",
] }
regex="1.10.6"

Observed Behavior

Memory usage increases with each iteration even after explicit drop After first iteration, receiving error:

Error: Channel was erroneously closed!

Expected Behavior

Memory should be properly freed when model is dropped
Channel should remain functional for subsequent iterations

Latest commit or version

mistralrs version: latest master: 32e894510696e9aa3c11db79268ee031a3ecefa6
Mac: M2
OS: Sonoma 14.7
Rust version: 1.80.1

Oct 19 '24 05:10 solaoi

@EricLBuehler I'm wondering if you have any plans to address this memory management issue in the library? While I could work around it using a web server or child processes for now, I'd like to understand your timeline for implementing a native solution. This would help me decide whether to proceed with a temporary workaround or wait for an official fix. Could you share your thoughts on this?

Oct 31 '24 04:10 solaoi

Hi everyone,

The memory leak happens because when numbering engine ids it starts at 0, instead of current engine ID + 1. The engine cannot terminate correctly because of this error.

FIX

File: mistral-core/src/engine/mod.rs

Change From

Self {
            rx,
            pipeline,
            scheduler: config.into_scheduler(),
            id: 0,
            truncate_sequence,
            no_kv_cache,
            prefix_cacher: PrefixCacheManagerV2::new(prefix_cache_n, no_prefix_cache),
            is_debug: DEBUG.load(Ordering::Relaxed),
            disable_eos_stop,
            throughput_logging_enabled,
        }

Change To:

Self {
            rx,
            pipeline,
            scheduler: config.into_scheduler(),
            id: ENGINE_ID.fetch_add(1, std::sync::atomic::Ordering::SeqCst),
            truncate_sequence,
            no_kv_cache,
            prefix_cacher: PrefixCacheManagerV2::new(prefix_cache_n, no_prefix_cache),
            is_debug: DEBUG.load(Ordering::Relaxed),
            disable_eos_stop,
            throughput_logging_enabled,
        }

Fixes the memory leak.

Andrew Lim

Feb 26 '25 01:02 andrewlimmer

@solaoi are you still having this issue?

I was experiencing something similar, but the issue I had was due to my own implementation, or a recent commit has fixed it.

I tried to reproduce with your repro, and I am unable to reproduce the bug.

I did change the model, and I am not using a chat template, but everything else is the same.

use anyhow::Result;
use mistralrs::{GgufModelBuilder, PagedAttentionMetaBuilder, TextMessageRole, TextMessages};
use std::time::Duration;
use tokio::time::sleep;

struct ChatService {
    model: Option<mistralrs::Model>,
}

impl ChatService {
    async fn new() -> Result<Self> {
        Ok(Self { model: None })
    }

    async fn initialize_model(&mut self) -> Result<()> {
        self.model = Some(
            GgufModelBuilder::new(
                "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
                vec!["Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"],
            )
            .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
            .build()
            .await?,
        );
        Ok(())
    }

    async fn chat(&self, prompt: &str) -> Result<String> {
        let messages = TextMessages::new().add_message(TextMessageRole::User, prompt);

        let response = self
            .model
            .as_ref()
            .unwrap()
            .send_chat_request(messages)
            .await?;

        Ok(response.choices[0]
            .message
            .content
            .clone()
            .unwrap_or_default())
    }
}

#[tokio::main]
async fn main() -> Result<()> {
    for i in 0..3 {
        println!("Iteration {}", i);

        let mut service = ChatService::new().await?;
        service.initialize_model().await?;

        let response = service.chat("Write a short greeting").await?;
        println!("Response: {}", response);

        // Model is dropped here, but GGUF remains in memory
        drop(service);

        // Wait to make memory usage observable
        sleep(Duration::from_secs(5)).await;
    }

    Ok(())
}

[package]
name = "channel_closed"
version = "0.1.0"
edition = "2024"

[dependencies]
anyhow = "1.0.98"
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", features = ["metal"]}
tokio = { version = "1.46.1", features = ["full"] }

Output

Iteration 0
Response: Hello, how are you today?
Iteration 1
Response: Hello, how are you today?
Iteration 2
Response: Hello, how are you today?

I think this issue can potentially be closed.

Jul 13 '25 06:07 eldyl

mistral.rs mistral.rs copied to clipboard

Memory leak and channel closure issues when reusing/dropping Model

Describe the bug

Steps to Reproduce

Observed Behavior

Expected Behavior

Latest commit or version

mistral.rs
mistral.rs copied to clipboard