multimodal-maestro Proposed repository structure

Proposed Code Structure

Every prompting pipeline comes with prompt_creator and result_processor. You can manually instantiate instances of those classes or call pipeline function providing name argument.

from abc import ABC, abstractmethod
from typing import Tuple, List, Dict
import numpy as np
import supervision as sv


class BasePromptCreator(ABC):
    @abstractmethod
    def create(self, image: np.ndarray, *args, **kwargs) -> Tuple[np.ndarray, sv.Detections]:
        """
        Create a prompt from an image and additional arguments.

        Args:
            image (np.ndarray): The input image.
            *args, **kwargs: Additional arguments.

        Returns:
            Tuple[np.ndarray, sv.Detections]: A tuple containing a processed image and detections.
        """
        pass


class BaseResultProcessor(ABC):
    @abstractmethod
    def process(self, text: str, marks: sv.Detections, *args, **kwargs) -> Dict[str, str]:
        """
        Process the results with given text and detections.

        Args:
            text (str): The input text.
            marks (sv.Detections): Detections to be used in processing.
            *args, **kwargs: Additional arguments.

        Returns:
            Dict[str, str]: Processed results.
        """
        pass


    @abstractmethod
    def visualize(self, text: str, image: np.ndarray, marks: sv.Detections, *args, **kwargs) -> np.ndarray:
        """
        Visualize the results on an image.

        Args:
            text (str): The input text.
            image (np.ndarray): The input image.
            marks (sv.Detections): Detections to be visualized.
            *args, **kwargs: Additional arguments.

        Returns:
            np.ndarray: The image with visualizations.
        """
        pass


class SamPromptCreator(BasePromptCreator):
    def __init__(self, device: str):
        self.device = device

    def create(image: np.ndarray, mask: Optional[np.ndarray] = none) -> Tuple[image: np.ndarray, sv.Detections]:
        pass


class SamResultProcessor(BaseResultProcessor):
    
    def process(text: str, marks: sv.Detections) -> List[str]:
        pass

    def visualize(text: str, image: np.ndarray, marks: sv.Detections) -> np.ndarray:
        pass


class GroundingDinoPromptCreator(BasePromptCreator):
    def __init__(self, device: str):
        self.device = device

    def create(image: np.ndarray, categories: List[str]) -> Tuple[image: np.ndarray, sv.Detections]:
        pass


class GroundingDinoResultProcessor(BaseResultProcessor):
    
    def process(text: str, marks: sv.Detections) -> Dict[str, str]:
        pass

    def visualize(text: str, image: np.ndarray, marks: sv.Detections) -> np.ndarray:
        pass


PIPELINES = {
    'sam': (SamPromptCreator, SamResultProcessor),
    'grounding-dino': (GroundingDinoPromptCreator, GroundingDinoResultProcessor)
}


def pipeline(name: str, **kwargs) -> Tuple[BasePromptCreator, BaseResultProcessor]:
    """Retrieves the prompt creator and result processor for the specified pipeline.

    Args:
        name (str): The name of the pipeline.
        **kwargs: Additional keyword arguments for initializing the classes.

    Returns:
        Tuple[BasePromptCreator, BaseResultProcessor]: Instances of the prompt creator and result processor.

    Raises:
        ValueError: If the pipeline name is not in the PIPELINES dictionary.
    """
    pipeline_classes = PIPELINES.get(name)

    if pipeline_classes is None:
        raise ValueError(f"Pipeline '{name}' not found. Please choose from {list(PIPELINES.keys())}.")

    PromptCreatorClass, ResultProcessorClass = pipeline_classes

    prompt_creator = PromptCreatorClass(**kwargs)
    result_processor = ResultProcessorClass(**kwargs)

    return prompt_creator, result_processor

Example Usage

LMM inference gets sandwiched between prompt_creator and result_processor calls.

import cv2
from maestro import pipeline, prompt_gpt4_vision

prompt_creator, result_processor = pipeline('sam', device='cuda')

image_prompt, marks = prompt_creator(image=image)
text_prompt = 'Find dog.'
api_key = '...'

response = prompt_gpt4_vision(
    text_prompt=text_prompt, 
    image_prompt=image_prompt, 
    api_key=api_key)

visualization = result_processor.visualize(
    text=response, 
    image=image, 
    marks=marks)

Nov 29 '23 21:11 SkalskiP

Looks good as a baseline, I am just wondering change in this theme would be more verbose:

maestro = build_maestro('sam', device='cuda').with("gpt-4")
result = maestro.prompt("Find a dog").with_image(image).visualize()

Naming conventions to be agreed - I just would like to point out that usage of prompt_creator and result_processor with custom things (that cannot be fully custom) in between - may bring confusion for less advanced users - especially that result_processor probably assumes some structure of response that may not be guaranteed given that client uses their own logic instead of prompt_gpt4_vision()

for more advanced use cases, however - I would let .with("gpt-4") to be replaced with .with(my_callable) where my_callable takes agreed parameters and clients can inject implementation.

Nov 30 '23 11:11 PawelPeczek-Roboflow

This makes sense to me for set of marks style prompts where you're annotating an image.

I think we may want to have some aspirational things that we may implement some day that we're keeping in mind as we design the API structure. Some thoughts on potential future directions of exploration:

Chaining - taking the output of one response, doing another transformation, and passing it back (eg "find the dog" -> it finds it -> we crop the photo to isolate the object of interest -> "describe this dog")
Few-shot - pulling similar images (and captions/annotations) from a vector DB & passing them along with your prompt to show by example what you want (or "spot the difference" style prompting against a reference image)
RAG - pulling relevant images from a vector DB to add additional context
Temporal / Video - to help with eg the sports broadcasting example
Tool use - using another model like a fine-tuned CNN to be able to add additional context
Integration with existing tools like LangChain (so you can eg us these prompting techniques as part of agent flows)

Nov 30 '23 16:11 yeldarby

Cool! I'll keep that in mind. We had a call with @PawelPeczek-Roboflow. We agreed on PromptCreator and ResultProcessor structure. Those can encapsulate a lot of the logic you just described. We just need to make sure the top layer allows to freely pass versions arguments. But because we are still not sure what we want to support we'll add high level API at the very end.

Dec 01 '23 08:12 SkalskiP

we are changing the profile of the project, making these old ideas obsolete

Sep 04 '24 10:09 SkalskiP

multimodal-maestro multimodal-maestro copied to clipboard

Proposed repository structure

Proposed Code Structure

Example Usage

multimodal-maestro
multimodal-maestro copied to clipboard