Describing Images Inline in PDFs for Better RAG
Title: Describing Images Inline in PDFs for Better RAG
I've been using LLMs to describe images like this, but its just support pass image directly:
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)
However, in many cases, my PDF files contain both text and embedded images. I want to extract both and generate inline image descriptions in Markdown to improve retrieval quality in GenAI.
Is there a way to achieve something like this?
Original PDF content:
Lorem ipsum dolor sit amet...
<-- image 1 - contains an orange cat --!>
Expected Markdown output:
Lorem ipsum dolor sit amet...
The image shows a fat orange cat sleeping on a white background.
How can I process both text and images together like this in a single conversion step?
The issue involves converting a PDF file that contains both text and embedded images into a Markdown document that includes:
The original text content.
Descriptions of each image, placed inline where the image appears.
This is important for retrieval-augmented generation (RAG) scenarios where large PDFs must be semantically enriched without manual intervention.
Below is a complete working Python solution that automates the process:
Dependencies You will need the following packages:
bash
pip install pypdf pdf2image openai Also make sure to install Poppler (used by pdf2image) for image rendering:
Windows: https://github.com/oschwartz10612/poppler-windows
macOS: brew install poppler
Linux: sudo apt install poppler-utils
Code python
import os from pypdf import PdfReader from pdf2image import convert_from_path from openai import OpenAI
client = OpenAI() model_name = "gpt-4o"
def extract_text_from_pdf(pdf_path): reader = PdfReader(pdf_path) text = "" for page in reader.pages: text += page.extract_text() + "\n\n" return text.strip()
def describe_image(image_path): with open(image_path, "rb") as img_file: encoded = img_file.read().hex() response = client.chat.completions.create( model=model_name, messages=[ {"role": "user", "content": [ {"type": "text", "text": "Describe this image in one paragraph."}, {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded}"}} ]} ] ) return response.choices[0].message.content.strip()
def convert_pdf_to_markdown(pdf_path, output_md="output.md"): text_content = extract_text_from_pdf(pdf_path) images = convert_from_path(pdf_path) markdown_output = [text_content]
for idx, image in enumerate(images):
image_file = f"page_{idx + 1}.png"
image.save(image_file, "PNG")
try:
description = describe_image(image_file)
markdown_output.append(f"\n\n\n\n{description}")
except Exception as e:
markdown_output.append(f"\n\n\n\nFailed to describe image: {e}")
final_output = "\n\n".join(markdown_output)
with open(output_md, "w", encoding="utf-8") as f:
f.write(final_output)
print(f"Markdown saved to {output_md}")
Result This script produces a .md file that contains:
All textual content from the original PDF.
Inline image tags.
AI-generated descriptions of each image, appended after the respective tag.
This provides a one-step conversion pipeline suitable for large documents and improves the usability of MarkItDown in AI-based indexing or summarization tasks.
Let me know if this should be adapted as a built-in function or module under convert_to_markdown_enriched() inside the MarkItDown ecosystem.
For PDFs containing only images, only an empty string will be output now.
Example file: https://disk.sample.cat/samples/pdf/sample-a4.pdf