crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: PDF doesn't get parsed

Open loorisr opened this issue 10 months ago • 2 comments

crawl4ai version

0.4.3b3

Expected Behavior

When crawling a PDF file, I'd like it to return it's content.

PS: I don't known if it's a bug or a feature request.

Current Behavior

I've tried with 2 PDF and it says sucess but doesn't parse it. (see the logs)

Is this reproducible?

Yes

Inputs Causing the Bug

https://www.electeursenherbe.fr/wp-content/uploads/2018/05/1LaPriseDeDecisionCollective_fichier_test-de-la-nasa.pdf
https://red.educagri.fr/wp-content/uploads/2014/10/Perdu-sur-la-lune-2.pdf

Steps to Reproduce


Code snippets

My docker file:

`
services:
  # Local build services for different platforms
  crawl4ai-amd64:
    build: 
      context: https://github.com/unclecode/crawl4ai.git
      args:
        PYTHON_VERSION: "3.10"
        INSTALL_TYPE: basic
        ENABLE_GPU: false
    ports:
      - "11235:11235"
      #- "8000:8000"
      #- "9222:9222"
      #- "8080:8080"
    environment:
      - CRAWL4AI_API_TOKEN=api_token
    volumes:
      - /dev/shm:/dev/shm
    deploy:
      resources:
        limits:
          memory: 4G
        reservations:
          memory: 1G
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
`

OS

Linux

Python version

3.10

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

{"urls":"https://red.educagri.fr/wp-content/uploads/2014/10/Perdu-sur-la-lune-2.pdf"}
{
  "status": "completed",
  "result": {
    "url": "https://red.educagri.fr/wp-content/uploads/2014/10/Perdu-sur-la-lune-2.pdf",
    "html": "<!DOCTYPE html><html><head></head><body style=\"height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);\"><embed name=\"BEB6FF08FC897DF1EAF51D15CD0A2EC2\" style=\"position:absolute; left: 0; top: 0;\" width=\"100%\" height=\"100%\" src=\"about:blank\" type=\"application/pdf\" internalid=\"BEB6FF08FC897DF1EAF51D15CD0A2EC2\"></body></html>",
    "success": true,
    "cleaned_html": "",
    "media": {
      "images": [],
      "videos": [],
      "audios": []
    },
    "links": {
      "internal": [],
      "external": []
    },
    "downloaded_files": null,
    "screenshot": null,
    "pdf": null,
    "markdown": "\n",
    "markdown_v2": {
      "raw_markdown": "\n",
      "markdown_with_citations": "\n",
      "references_markdown": "\n\n## References\n\n",
      "fit_markdown": "",
      "fit_html": ""
    },
    "fit_markdown": "",
    "fit_html": "",
    "extracted_content": null,
    "metadata": {
      "title": null,
      "description": null,
      "keywords": null,
      "author": null
    },
    "error_message": "",
    "session_id": null,
    "response_headers": {
      "accept-ranges": "bytes",
      "content-length": "63867",
      "content-type": "application/pdf",
      "date": "Fri, 14 Feb 2025 11:31:37 GMT",
      "etag": "\"65d35f77-f97b\"",
      "last-modified": "Mon, 19 Feb 2024 14:02:31 GMT",
      "server": "nginx"
    },
    "status_code": 200,
    "ssl_certificate": null,
    "dispatch_result": null,
    "redirected_url": "https://red.educagri.fr/wp-content/uploads/2014/10/Perdu-sur-la-lune-2.pdf"
  }
}
{"urls":"https://www.electeursenherbe.fr/wp-content/uploads/2018/05/1LaPriseDeDecisionCollective_fichier_test-de-la-nasa.pdf"}

{
  "status": "completed",
  "result": {
    "url": "https://www.electeursenherbe.fr/wp-content/uploads/2018/05/1LaPriseDeDecisionCollective_fichier_test-de-la-nasa.pdf",
    "html": "<!DOCTYPE html><html><head></head><body style=\"height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);\"><embed name=\"910BD86C54E86D12D86BD6A60E216138\" style=\"position:absolute; left: 0; top: 0;\" width=\"100%\" height=\"100%\" src=\"about:blank\" type=\"application/pdf\" internalid=\"910BD86C54E86D12D86BD6A60E216138\"></body></html>",
    "success": true,
    "cleaned_html": "",
    "media": {
      "images": [],
      "videos": [],
      "audios": []
    },
    "links": {
      "internal": [],
      "external": []
    },
    "downloaded_files": [],
    "screenshot": "",
    "pdf": null,
    "markdown": "\n",
    "markdown_v2": {
      "raw_markdown": "\n",
      "markdown_with_citations": "\n",
      "references_markdown": "\n\n## References\n\n",
      "fit_markdown": "",
      "fit_html": ""
    },
    "fit_markdown": null,
    "fit_html": null,
    "extracted_content": "",
    "metadata": {
      "title": null,
      "description": null,
      "keywords": null,
      "author": null
    },
    "error_message": null,
    "session_id": null,
    "response_headers": {
      "accept-ranges": "bytes",
      "cache-control": "max-age=3600, public",
      "content-length": "226213",
      "content-type": "application/pdf",
      "date": "Fri, 14 Feb 2025 11:30:59 GMT",
      "expires": "Fri, 14 Feb 2025 12:30:59 GMT",
      "last-modified": "Mon, 28 May 2018 04:02:13 GMT",
      "server": "OVHcloud",
      "vary": "Accept-Encoding"
    },
    "status_code": null,
    "ssl_certificate": null,
    "dispatch_result": null,
    "redirected_url": "https://www.electeursenherbe.fr/wp-content/uploads/2018/05/1LaPriseDeDecisionCollective_fichier_test-de-la-nasa.pdf"
  }
}

loorisr avatar Feb 14 '25 12:02 loorisr

@loorisr Docker is a little behind w.r.t to the core library. Just a headsup, we are releasing a new docker setup in the upcoming versions. I'll also investigate what's going in this particular case.

aravindkarnam avatar Feb 17 '25 12:02 aravindkarnam

thanks!

loorisr avatar Feb 17 '25 17:02 loorisr

thanks! Did it work now? I want crawl a pdf ,the log look likes yours, crawl4ai version is 0.5.0

ROBODRILL avatar Apr 17 '25 03:04 ROBODRILL

using current docker image, v0.6.0rc1, I have the following output:

    {
      "url": "https://red.educagri.fr/wp-content/uploads/2014/10/Perdu-sur-la-lune-2.pdf",
      "html": "<!DOCTYPE html><html><head></head><body style=\"height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);\"><embed name=\"6883732B0D11540CA2051E108DE951B8\" style=\"position:absolute; left: 0; top: 0;\" width=\"100%\" height=\"100%\" src=\"about:blank\" type=\"application/pdf\" internalid=\"6883732B0D11540CA2051E108DE951B8\"></body></html>",
      "success": true,
      "cleaned_html": "",
      "media": {
        "images": [],
        "videos": [],
        "audios": [],
        "tables": []
      }

loorisr avatar Apr 22 '25 18:04 loorisr

@loorisr Thanks for taking the time to verify this. I'll check this out today!

aravindkarnam avatar Apr 24 '25 07:04 aravindkarnam

@loorisr

Hi. The Docker version doesn’t currently support PDF and crawler strategies. We’ll add it to our backlog for future updates.

@aravindkarnam pls help to add it to our backlog

ntohidi avatar May 05 '25 11:05 ntohidi

I will close this issue, but feel free to continue the conversation.

ntohidi avatar May 05 '25 11:05 ntohidi