markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Describing Images Inline in PDFs for Better RAG

Open minhnghia2k3 opened this issue 5 months ago • 2 comments

Title: Describing Images Inline in PDFs for Better RAG

I've been using LLMs to describe images like this, but its just support pass image directly:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)

However, in many cases, my PDF files contain both text and embedded images. I want to extract both and generate inline image descriptions in Markdown to improve retrieval quality in GenAI.

Is there a way to achieve something like this?

Original PDF content:

Lorem ipsum dolor sit amet...  
<-- image 1 - contains an orange cat --!>

Expected Markdown output:

Lorem ipsum dolor sit amet...

The image shows a fat orange cat sleeping on a white background.

How can I process both text and images together like this in a single conversion step?

minhnghia2k3 avatar Jul 17 '25 03:07 minhnghia2k3

The issue involves converting a PDF file that contains both text and embedded images into a Markdown document that includes:

The original text content.

Descriptions of each image, placed inline where the image appears.

This is important for retrieval-augmented generation (RAG) scenarios where large PDFs must be semantically enriched without manual intervention.

Below is a complete working Python solution that automates the process:

Dependencies You will need the following packages:

bash

pip install pypdf pdf2image openai Also make sure to install Poppler (used by pdf2image) for image rendering:

Windows: https://github.com/oschwartz10612/poppler-windows

macOS: brew install poppler

Linux: sudo apt install poppler-utils

Code python

import os from pypdf import PdfReader from pdf2image import convert_from_path from openai import OpenAI

client = OpenAI() model_name = "gpt-4o"

def extract_text_from_pdf(pdf_path): reader = PdfReader(pdf_path) text = "" for page in reader.pages: text += page.extract_text() + "\n\n" return text.strip()

def describe_image(image_path): with open(image_path, "rb") as img_file: encoded = img_file.read().hex() response = client.chat.completions.create( model=model_name, messages=[ {"role": "user", "content": [ {"type": "text", "text": "Describe this image in one paragraph."}, {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded}"}} ]} ] ) return response.choices[0].message.content.strip()

def convert_pdf_to_markdown(pdf_path, output_md="output.md"): text_content = extract_text_from_pdf(pdf_path) images = convert_from_path(pdf_path) markdown_output = [text_content]

for idx, image in enumerate(images):
    image_file = f"page_{idx + 1}.png"
    image.save(image_file, "PNG")
    try:
        description = describe_image(image_file)
        markdown_output.append(f"\n\n![Image {idx + 1}]({image_file})\n\n{description}")
    except Exception as e:
        markdown_output.append(f"\n\n![Image {idx + 1}]({image_file})\n\nFailed to describe image: {e}")

final_output = "\n\n".join(markdown_output)

with open(output_md, "w", encoding="utf-8") as f:
    f.write(final_output)

print(f"Markdown saved to {output_md}")

Result This script produces a .md file that contains:

All textual content from the original PDF.

Inline image tags.

AI-generated descriptions of each image, appended after the respective tag.

This provides a one-step conversion pipeline suitable for large documents and improves the usability of MarkItDown in AI-based indexing or summarization tasks.

Let me know if this should be adapted as a built-in function or module under convert_to_markdown_enriched() inside the MarkItDown ecosystem.

yossefelnggar avatar Jul 19 '25 05:07 yossefelnggar

For PDFs containing only images, only an empty string will be output now.

Example file: https://disk.sample.cat/samples/pdf/sample-a4.pdf

Diluka avatar Nov 07 '25 07:11 Diluka