mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

Allocation error when running generation script

Open MDR4356 opened this issue 1 year ago • 2 comments

I am trying to run a simple generation script, looking to a file for a prompt.

# Now try to import from mlx_lm
from mlx_lm import load, generate

# Model settings
MODEL_REPO = "mlx-community/Meta-Llama-3.1-70B-4bit"

def process_file(input_file: str, output_file: str):
    print("Loading model...")
    model, tokenizer = load(MODEL_REPO)
    
    print("Reading input file...")
    with open(input_file, 'r') as f:
        prompt = f.read()
    
    print("Generating response...")
    response = generate(model, tokenizer, prompt=prompt, max_tokens=30000, temp=0.7)
    
    print("Saving response to output file...")
    with open(output_file, 'w') as f:
        f.write(response)
    
    print(f"Process complete. Response saved to {output_file}")

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python script_name.py input_file output_file")
        sys.exit(1)
    
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    process_file(input_file, output_file)

I keep getting this error:

Loading model... Fetching 13 files: 100%|████████████████████| 13/13 [00:00<00:00, 144631.17it/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Reading input file... Generating response... libc++abi: terminating due to uncaught exception of type std::runtime_error: Attempting to allocate 90935771648 bytes which is greater than the maximum allowed buffer size of 77309411328 bytes.

I'm not sure where those numbers are coming from. In any event, I have 128 GB of VRAM on my Macbook, and I've manually set the amount to about 100 using the sysctl iogpu.wired_limit_mb command, thinking that might help. But it doesn't. Any ideas?

MDR4356 avatar Jul 24 '24 23:07 MDR4356

There is a maximum allowed buffer size for a given machine and you can't have arrays which require more than that. You can check the max_buffer_length in mx.metal.device_info().

Presumably you are trying to process a very long prompt. How long is the prompt in your file? MLX LM doesn't handle super long prompts right now. In order to make that work we'll need either or both of:

  • Some kind of strategy for splitting the prompt and processing it in pieces
  • Memory efficient attention

The second is already WIP but I don't have a time line yet. I can spend some time looking into the first.

awni avatar Jul 24 '24 23:07 awni

Thanks Awni! That makes sense. Yes, it's long—roughly 20,000 words. I have previously used scripts implementing your first solution. I was just trying to test 3.1's new, longer context. But I understand now that won't be possible for a bit.

Here, for example, I was trying to summarize a long transcript:

def chunk_transcript(file_path: str, target_chunk_size: int = 2500) -> List[Tuple[str, str]]:
    with open(file_path, 'r') as file:
        content = file.read()
    
    # Split the content at "THE COURT:" occurrences, keeping the delimiter
    sections = re.split(r'(?=THE COURT:)', content)
    
    chunks = []
    current_chunk = ""
    current_word_count = 0
    current_context = ""
    
    for section in sections:
        words = section.split()
        word_count = len(words)
        
        if section.strip().startswith("THE COURT:"):
            current_context = "THE COURT:"
        
        if current_word_count + word_count > target_chunk_size and current_chunk:
            chunks.append((current_context, current_chunk.strip()))
            current_chunk = section
            current_word_count = word_count
        else:
            current_chunk += section
            current_word_count += word_count
    
    # Add the last chunk if it's not empty
    if current_chunk.strip():
        chunks.append((current_context, current_chunk.strip()))
    
    return chunks

MDR4356 avatar Jul 25 '24 00:07 MDR4356

Recent update to MLX LM can handle much longer prompts now.

You can also play with the parameter --max-kv-size can trade off memory use / speed for accuracy. I would set it to something high (4096) or None to start and then shrink it (1024) if that's too slow / consumes too much RAM.

awni avatar Aug 17 '24 00:08 awni

Closing for now, let us know if you still have issues with long prompts..

awni avatar Aug 17 '24 00:08 awni