multimodal-maestro
multimodal-maestro copied to clipboard
Proposed repository structure
Proposed Code Structure
Every prompting pipeline comes with prompt_creator
and result_processor
. You can manually instantiate instances of those classes or call pipeline
function providing name
argument.
from abc import ABC, abstractmethod
from typing import Tuple, List, Dict
import numpy as np
import supervision as sv
class BasePromptCreator(ABC):
@abstractmethod
def create(self, image: np.ndarray, *args, **kwargs) -> Tuple[np.ndarray, sv.Detections]:
"""
Create a prompt from an image and additional arguments.
Args:
image (np.ndarray): The input image.
*args, **kwargs: Additional arguments.
Returns:
Tuple[np.ndarray, sv.Detections]: A tuple containing a processed image and detections.
"""
pass
class BaseResultProcessor(ABC):
@abstractmethod
def process(self, text: str, marks: sv.Detections, *args, **kwargs) -> Dict[str, str]:
"""
Process the results with given text and detections.
Args:
text (str): The input text.
marks (sv.Detections): Detections to be used in processing.
*args, **kwargs: Additional arguments.
Returns:
Dict[str, str]: Processed results.
"""
pass
@abstractmethod
def visualize(self, text: str, image: np.ndarray, marks: sv.Detections, *args, **kwargs) -> np.ndarray:
"""
Visualize the results on an image.
Args:
text (str): The input text.
image (np.ndarray): The input image.
marks (sv.Detections): Detections to be visualized.
*args, **kwargs: Additional arguments.
Returns:
np.ndarray: The image with visualizations.
"""
pass
class SamPromptCreator(BasePromptCreator):
def __init__(self, device: str):
self.device = device
def create(image: np.ndarray, mask: Optional[np.ndarray] = none) -> Tuple[image: np.ndarray, sv.Detections]:
pass
class SamResultProcessor(BaseResultProcessor):
def process(text: str, marks: sv.Detections) -> List[str]:
pass
def visualize(text: str, image: np.ndarray, marks: sv.Detections) -> np.ndarray:
pass
class GroundingDinoPromptCreator(BasePromptCreator):
def __init__(self, device: str):
self.device = device
def create(image: np.ndarray, categories: List[str]) -> Tuple[image: np.ndarray, sv.Detections]:
pass
class GroundingDinoResultProcessor(BaseResultProcessor):
def process(text: str, marks: sv.Detections) -> Dict[str, str]:
pass
def visualize(text: str, image: np.ndarray, marks: sv.Detections) -> np.ndarray:
pass
PIPELINES = {
'sam': (SamPromptCreator, SamResultProcessor),
'grounding-dino': (GroundingDinoPromptCreator, GroundingDinoResultProcessor)
}
def pipeline(name: str, **kwargs) -> Tuple[BasePromptCreator, BaseResultProcessor]:
"""Retrieves the prompt creator and result processor for the specified pipeline.
Args:
name (str): The name of the pipeline.
**kwargs: Additional keyword arguments for initializing the classes.
Returns:
Tuple[BasePromptCreator, BaseResultProcessor]: Instances of the prompt creator and result processor.
Raises:
ValueError: If the pipeline name is not in the PIPELINES dictionary.
"""
pipeline_classes = PIPELINES.get(name)
if pipeline_classes is None:
raise ValueError(f"Pipeline '{name}' not found. Please choose from {list(PIPELINES.keys())}.")
PromptCreatorClass, ResultProcessorClass = pipeline_classes
prompt_creator = PromptCreatorClass(**kwargs)
result_processor = ResultProcessorClass(**kwargs)
return prompt_creator, result_processor
Example Usage
LMM inference gets sandwiched between prompt_creator
and result_processor
calls.
import cv2
from maestro import pipeline, prompt_gpt4_vision
prompt_creator, result_processor = pipeline('sam', device='cuda')
image_prompt, marks = prompt_creator(image=image)
text_prompt = 'Find dog.'
api_key = '...'
response = prompt_gpt4_vision(
text_prompt=text_prompt,
image_prompt=image_prompt,
api_key=api_key)
visualization = result_processor.visualize(
text=response,
image=image,
marks=marks)
Looks good as a baseline, I am just wondering change in this theme would be more verbose:
maestro = build_maestro('sam', device='cuda').with("gpt-4")
result = maestro.prompt("Find a dog").with_image(image).visualize()
Naming conventions to be agreed - I just would like to point out that usage of prompt_creator and result_processor with custom things (that cannot be fully custom) in between - may bring confusion for less advanced users - especially that result_processor
probably assumes some structure of response that may not be guaranteed given that client uses their own logic instead of prompt_gpt4_vision()
for more advanced use cases, however - I would let .with("gpt-4")
to be replaced with .with(my_callable)
where my_callable
takes agreed parameters and clients can inject implementation.
This makes sense to me for set of marks style prompts where you're annotating an image.
I think we may want to have some aspirational things that we may implement some day that we're keeping in mind as we design the API structure. Some thoughts on potential future directions of exploration:
- Chaining - taking the output of one response, doing another transformation, and passing it back (eg "find the dog" -> it finds it -> we crop the photo to isolate the object of interest -> "describe this dog")
- Few-shot - pulling similar images (and captions/annotations) from a vector DB & passing them along with your prompt to show by example what you want (or "spot the difference" style prompting against a reference image)
- RAG - pulling relevant images from a vector DB to add additional context
- Temporal / Video - to help with eg the sports broadcasting example
- Tool use - using another model like a fine-tuned CNN to be able to add additional context
- Integration with existing tools like LangChain (so you can eg us these prompting techniques as part of agent flows)
Cool! I'll keep that in mind. We had a call with @PawelPeczek-Roboflow. We agreed on PromptCreator and ResultProcessor structure. Those can encapsulate a lot of the logic you just described. We just need to make sure the top layer allows to freely pass versions arguments. But because we are still not sure what we want to support we'll add high level API at the very end.
we are changing the profile of the project, making these old ideas obsolete