mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

Memory leak and channel closure issues when reusing/dropping Model

Open solaoi opened this issue 1 year ago • 2 comments
trafficstars

Describe the bug

When initializing and dropping the Model repeatedly:

  1. Memory usage continuously increases as GGUF models aren't properly cleaned up
  2. Channel is erroneously closed after the first iteration

Steps to Reproduce

  1. Create a service that initializes and drops the model multiple times
  2. Run the following code:
use anyhow::Result;
use mistralrs::{GgufModelBuilder, PagedAttentionMetaBuilder, TextMessageRole, TextMessages};
use std::time::Duration;
use tokio::time::sleep;

struct ChatService {
    model: Option<mistralrs::Model>,
}

impl ChatService {
    async fn new() -> Result<Self> {
        Ok(Self { model: None })
    }

    async fn initialize_model(&mut self) -> Result<()> {
        self.model = Some(
            GgufModelBuilder::new(
                "gguf_models/mistral_v0.1/",
                vec!["mistral-7b-instruct-v0.1.Q4_K_M.gguf"],
            )
            .with_chat_template("chat_templates/mistral.json")
            .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
            .build()
            .await?,
        );
        Ok(())
    }

    async fn chat(&self, prompt: &str) -> Result<String> {
        let messages = TextMessages::new().add_message(TextMessageRole::User, prompt);

        let response = self
            .model
            .as_ref()
            .unwrap()
            .send_chat_request(messages)
            .await?;

        Ok(response.choices[0]
            .message
            .content
            .clone()
            .unwrap_or_default())
    }
}

#[tokio::main]
async fn main() -> Result<()> {
    for i in 0..3 {
        println!("Iteration {}", i);

        let mut service = ChatService::new().await?;
        service.initialize_model().await?;

        let response = service.chat("Write a short greeting").await?;
        println!("Response: {}", response);

        // Model is dropped here, but GGUF remains in memory
        drop(service);

        // Wait to make memory usage observable
        sleep(Duration::from_secs(5)).await;
    }

    Ok(())
}

Cargo.toml is here:

[package]
name = "memory_bug_mistral"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
tokio = { version = "1", features = ["full"] }
anyhow = "1.0"
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", branch = "master", features = [
    "metal",
] }
regex="1.10.6"

Observed Behavior

Memory usage increases with each iteration even after explicit drop After first iteration, receiving error:

Error: Channel was erroneously closed!

Expected Behavior

  1. Memory should be properly freed when model is dropped
  2. Channel should remain functional for subsequent iterations

Latest commit or version

solaoi avatar Oct 19 '24 05:10 solaoi

@EricLBuehler I'm wondering if you have any plans to address this memory management issue in the library? While I could work around it using a web server or child processes for now, I'd like to understand your timeline for implementing a native solution. This would help me decide whether to proceed with a temporary workaround or wait for an official fix. Could you share your thoughts on this?

solaoi avatar Oct 31 '24 04:10 solaoi

Hi everyone,

The memory leak happens because when numbering engine ids it starts at 0, instead of current engine ID + 1. The engine cannot terminate correctly because of this error.

FIX

File: mistral-core/src/engine/mod.rs

Change From

Self {
            rx,
            pipeline,
            scheduler: config.into_scheduler(),
            id: 0,
            truncate_sequence,
            no_kv_cache,
            prefix_cacher: PrefixCacheManagerV2::new(prefix_cache_n, no_prefix_cache),
            is_debug: DEBUG.load(Ordering::Relaxed),
            disable_eos_stop,
            throughput_logging_enabled,
        }

Change To:

Self {
            rx,
            pipeline,
            scheduler: config.into_scheduler(),
            id: ENGINE_ID.fetch_add(1, std::sync::atomic::Ordering::SeqCst),
            truncate_sequence,
            no_kv_cache,
            prefix_cacher: PrefixCacheManagerV2::new(prefix_cache_n, no_prefix_cache),
            is_debug: DEBUG.load(Ordering::Relaxed),
            disable_eos_stop,
            throughput_logging_enabled,
        }

Fixes the memory leak.

Andrew Lim

andrewlimmer avatar Feb 26 '25 01:02 andrewlimmer

@solaoi are you still having this issue?

I was experiencing something similar, but the issue I had was due to my own implementation, or a recent commit has fixed it.

I tried to reproduce with your repro, and I am unable to reproduce the bug.

I did change the model, and I am not using a chat template, but everything else is the same.

use anyhow::Result;
use mistralrs::{GgufModelBuilder, PagedAttentionMetaBuilder, TextMessageRole, TextMessages};
use std::time::Duration;
use tokio::time::sleep;

struct ChatService {
    model: Option<mistralrs::Model>,
}

impl ChatService {
    async fn new() -> Result<Self> {
        Ok(Self { model: None })
    }

    async fn initialize_model(&mut self) -> Result<()> {
        self.model = Some(
            GgufModelBuilder::new(
                "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
                vec!["Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"],
            )
            .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
            .build()
            .await?,
        );
        Ok(())
    }

    async fn chat(&self, prompt: &str) -> Result<String> {
        let messages = TextMessages::new().add_message(TextMessageRole::User, prompt);

        let response = self
            .model
            .as_ref()
            .unwrap()
            .send_chat_request(messages)
            .await?;

        Ok(response.choices[0]
            .message
            .content
            .clone()
            .unwrap_or_default())
    }
}

#[tokio::main]
async fn main() -> Result<()> {
    for i in 0..3 {
        println!("Iteration {}", i);

        let mut service = ChatService::new().await?;
        service.initialize_model().await?;

        let response = service.chat("Write a short greeting").await?;
        println!("Response: {}", response);

        // Model is dropped here, but GGUF remains in memory
        drop(service);

        // Wait to make memory usage observable
        sleep(Duration::from_secs(5)).await;
    }

    Ok(())
}
[package]
name = "channel_closed"
version = "0.1.0"
edition = "2024"

[dependencies]
anyhow = "1.0.98"
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", features = ["metal"]}
tokio = { version = "1.46.1", features = ["full"] }

Output

Iteration 0
Response: Hello, how are you today?
Iteration 1
Response: Hello, how are you today?
Iteration 2
Response: Hello, how are you today?

I think this issue can potentially be closed.

eldyl avatar Jul 13 '25 06:07 eldyl