langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Support for pdf content

Open abh1shek-sh opened this issue 11 months ago • 5 comments

As of now, the Message can be either of type :text, image_url or :image, .

Would it be possible to also add the types: :pdf and :pdf_url. To me, it looks like the process of sending the pdf to the llm is very similar to that of image. Atleast, in the case of Gemini APIs.

Here is code of snippet from the Gemini SDK:

from google import genai
from google.genai import types
import pathlib
import httpx

client = genai.Client()

doc_url = "https://discovery.ucl.ac.uk/id/eprint/10089234/1/343019_3_art_0_py4t4l_convrt.pdf"  # Replace with the actual URL of your PDF

# Retrieve and encode the PDF byte
filepath = pathlib.Path('file.pdf')
filepath.write_bytes(httpx.get(doc_url).content)

prompt = "Summarize this document"
response = client.models.generate_content(
  model="gemini-1.5-flash",
  contents=[
      types.Part.from_bytes(
        data=filepath.read_bytes(),
        mime_type='application/pdf',
      ),
      prompt])
print(response.text)

Since, the pdf parsing library in Elixir aren't that great, the general advice is to either use NIFs or rely on third party services. I think this would be very helpful if it could be implemented.

Thanks!

abh1shek-sh avatar Mar 11 '25 11:03 abh1shek-sh

There's probably a more general case to consider - OpenAI supports PDFs via general file attachments - see https://platform.openai.com/docs/guides/pdf-files (the relevant bit is below). The Message.ContentPart should probably support attaching arbitrary attributes. It looks like the options are constrained in the models (e.g. for_api functions in LangChain.ChatModels.ChatOpenAI). Some way of overriding / forcing this would be great (with appropriate warnings!)

curl "https://api.openai.com/v1/responses" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -d '{
        "model": "gpt-4o",
        "input": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_file",
                        "filename": "draconomicon.pdf",
                        "file_data": "...base64 encoded PDF bytes here..."
                    },
                    {
                        "type": "input_text",
                        "text": "What is the first dragon in the book?"
                    }
                ]
            }
        ]
    }'

mindok avatar Mar 13 '25 04:03 mindok

I can see a more general mime-type like support for attachment types. The constraint would be that the model must support it.

brainlid avatar Mar 13 '25 16:03 brainlid

Sounds good to me, Almost all of the major providers support attachment of various types. A list of attachment types could be maintained for each model/provider, I guess.

abh1shek-sh avatar Mar 17 '25 06:03 abh1shek-sh

I think atleast OpenAI and Gemini APIs now expect a "message" to be made of multiple "parts", each with its own type and object structure.

Here are OpenAI's input types:

Image

Here are OpenAI's message part types:

message
file_search_call
function_call
web_search_call
computer_call
reasoning
image_generation_call
code_interpreter_call
local_shell_call
mcp_call
mcp_list_tools
mcp_approval_request

Basically, these vendors APIs are more and more for their "agents" and not just "LLMs". See #289

nileshtrivedi avatar Jun 06 '25 08:06 nileshtrivedi

Please check out v0.4.0-rc.1. It changes more things to use ContentParts which have types. PDF is a type as well.

I've used PDFs with ChatAnthropic, but I haven't used it with Gemini personally.

brainlid avatar Jul 03 '25 21:07 brainlid