Batch Processing Feature
Overview
The goal is to add support for efficient batch processing of inputs to the MLX-VLM library. This will allow users to process multiple images and text prompts simultaneously to generate corresponding outputs in a single batch, improving performance.
Use cases:
- Generating captions for a large dataset of images.
- Localizing objects or regions in a batch of images based on textual descriptions.
- Classifying a large number of images into predefined categories, considering accompanying text information.
- Answering questions based on a batch of images (single and multiple question prompts).
- Video processing.
Note: Tag @Blaizzy for code reviews and questions.
Requirements
Support batched inputs:
- Accept a batch of images as input, provided as a list or array of image objects.
- Accept a batch of text prompts as input, provided as a list or array of strings.
- Accept a single text prompt as input, provided as a string.
Perform batch processing:
- Process the batch of images and text prompts simultaneously (async) using the MLX-VLM model.
- Utilize parallel processing or GPU acceleration to optimize batch processing performance.
- Ensure that the processing of one input in the batch does not affect the processing of other inputs.
Generate batched outputs:
- Return the generated outputs for each input in the batch.
- Maintain the order of the outputs corresponding to the order of the inputs.
- Support different output formats such as text, embeddings, or visual representations based on the specific task.
Error handling:
- Handle errors gracefully during batch processing.
- Provide informative error messages for invalid inputs or processing failures.
- Continue processing the remaining inputs in the batch if an error occurs for a specific input.
API design:
- Provide a clear and intuitive API for users to perform batch processing.
- Allow users to specify the maximum batch size supported by their system.
- Provide options to control the batch processing behavior, such as enabling/disabling parallel processing.
Documentation and examples:
- Update the library documentation to include information about the batch processing feature.
- Provide code examples demonstrating how to use the batch processing API effectively.
- Include performance benchmarks and guidelines for optimal batch sizes based on system resources.
Implementation
- Modify the existing input handling logic to accept batches of images and text prompts.
- Implement batch processing functionality using parallel processing techniques or GPU acceleration libraries.
- Optimize memory usage and performance for efficient batch processing.
- Update the output generation logic to handle batched outputs and maintain the correct order.
- Implement error handling mechanisms to gracefully handle and report errors during batch processing.
- Design and expose a user-friendly API for performing batch processing.
- Write unit tests to verify the correctness and performance of the batch processing implementation.
- Update the library documentation and provide code examples for using the batch processing feature.
Testing
- Prepare a comprehensive test suite to validate the batch processing functionality.
- Test with different batch sizes and input variations to ensure robustness.
- Verify that the generated outputs match the expected results for each input in the batch.
- Measure the performance improvement gained by batch processing compared to individual processing.
- Conduct error handling tests to ensure graceful handling of invalid inputs and processing failures.
Delivery
- Integrate the batch processing feature into the existing MLX-VLM library codebase.
- Ensure backward compatibility with previous versions of the library.
- Provide release notes highlighting the new batch processing capability and any breaking changes.
- Update the library version number following semantic versioning conventions.
- Publish the updated library package to the relevant package repositories or distribution channels.
By implementing this batch processing feature, MLX-VLM will provide users with the ability to efficiently process multiple inputs simultaneously, improving performance and usability of the library for various vision-language tasks.
Will take it for implementation! hope to meet the standards :)
Here are some extra details: @willccbb
Sorry, just saw this -- will take a swing when #53 is merged.
@willccbb done ✅
#53 is merged
Hey @willccbb, any update on this? Would be super helpful to have
@willccbb doesn't have the bandwidth.
This feature is now open and back in backlog.
Hi @Blaizzy, I'm interested in implementing the batch processing feature for MLX-VLM.
After reviewing the issue requirements and the existing codebase, I understand this involves:
- Implementing a BatchedKVCache to enable parallel inference
- Adding batch_generate utilities for efficient parallel processing
- Creating new API endpoints for batch operations
- Updating model architectures to handle batched inputs
The PR #53 that refactored KVCache implementation provides a good foundation for this work.
I plan to implement this in stages:
- First, the core BatchedKVCache class
- Then batch processing utilities
- API endpoints for batch operations
- Finally, updating model architectures to use batched operations
My implementation will include configurable parameters for batch generation with sensible defaults:
- Default batch size with options to configure larger sizes for high-memory systems
- Memory management options to balance efficiency vs throughput
- Configuration flags to optimize for different hardware capabilities
Would you be open to my contribution?
Looking forward to your feedback!
Hi! Any updates on this? :)