[Bug]: PDF doesn't get parsed
crawl4ai version
0.4.3b3
Expected Behavior
When crawling a PDF file, I'd like it to return it's content.
PS: I don't known if it's a bug or a feature request.
Current Behavior
I've tried with 2 PDF and it says sucess but doesn't parse it. (see the logs)
Is this reproducible?
Yes
Inputs Causing the Bug
https://www.electeursenherbe.fr/wp-content/uploads/2018/05/1LaPriseDeDecisionCollective_fichier_test-de-la-nasa.pdf
https://red.educagri.fr/wp-content/uploads/2014/10/Perdu-sur-la-lune-2.pdf
Steps to Reproduce
Code snippets
My docker file:
`
services:
# Local build services for different platforms
crawl4ai-amd64:
build:
context: https://github.com/unclecode/crawl4ai.git
args:
PYTHON_VERSION: "3.10"
INSTALL_TYPE: basic
ENABLE_GPU: false
ports:
- "11235:11235"
#- "8000:8000"
#- "9222:9222"
#- "8080:8080"
environment:
- CRAWL4AI_API_TOKEN=api_token
volumes:
- /dev/shm:/dev/shm
deploy:
resources:
limits:
memory: 4G
reservations:
memory: 1G
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
`
OS
Linux
Python version
3.10
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
{"urls":"https://red.educagri.fr/wp-content/uploads/2014/10/Perdu-sur-la-lune-2.pdf"}
{
"status": "completed",
"result": {
"url": "https://red.educagri.fr/wp-content/uploads/2014/10/Perdu-sur-la-lune-2.pdf",
"html": "<!DOCTYPE html><html><head></head><body style=\"height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);\"><embed name=\"BEB6FF08FC897DF1EAF51D15CD0A2EC2\" style=\"position:absolute; left: 0; top: 0;\" width=\"100%\" height=\"100%\" src=\"about:blank\" type=\"application/pdf\" internalid=\"BEB6FF08FC897DF1EAF51D15CD0A2EC2\"></body></html>",
"success": true,
"cleaned_html": "",
"media": {
"images": [],
"videos": [],
"audios": []
},
"links": {
"internal": [],
"external": []
},
"downloaded_files": null,
"screenshot": null,
"pdf": null,
"markdown": "\n",
"markdown_v2": {
"raw_markdown": "\n",
"markdown_with_citations": "\n",
"references_markdown": "\n\n## References\n\n",
"fit_markdown": "",
"fit_html": ""
},
"fit_markdown": "",
"fit_html": "",
"extracted_content": null,
"metadata": {
"title": null,
"description": null,
"keywords": null,
"author": null
},
"error_message": "",
"session_id": null,
"response_headers": {
"accept-ranges": "bytes",
"content-length": "63867",
"content-type": "application/pdf",
"date": "Fri, 14 Feb 2025 11:31:37 GMT",
"etag": "\"65d35f77-f97b\"",
"last-modified": "Mon, 19 Feb 2024 14:02:31 GMT",
"server": "nginx"
},
"status_code": 200,
"ssl_certificate": null,
"dispatch_result": null,
"redirected_url": "https://red.educagri.fr/wp-content/uploads/2014/10/Perdu-sur-la-lune-2.pdf"
}
}
{"urls":"https://www.electeursenherbe.fr/wp-content/uploads/2018/05/1LaPriseDeDecisionCollective_fichier_test-de-la-nasa.pdf"}
{
"status": "completed",
"result": {
"url": "https://www.electeursenherbe.fr/wp-content/uploads/2018/05/1LaPriseDeDecisionCollective_fichier_test-de-la-nasa.pdf",
"html": "<!DOCTYPE html><html><head></head><body style=\"height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);\"><embed name=\"910BD86C54E86D12D86BD6A60E216138\" style=\"position:absolute; left: 0; top: 0;\" width=\"100%\" height=\"100%\" src=\"about:blank\" type=\"application/pdf\" internalid=\"910BD86C54E86D12D86BD6A60E216138\"></body></html>",
"success": true,
"cleaned_html": "",
"media": {
"images": [],
"videos": [],
"audios": []
},
"links": {
"internal": [],
"external": []
},
"downloaded_files": [],
"screenshot": "",
"pdf": null,
"markdown": "\n",
"markdown_v2": {
"raw_markdown": "\n",
"markdown_with_citations": "\n",
"references_markdown": "\n\n## References\n\n",
"fit_markdown": "",
"fit_html": ""
},
"fit_markdown": null,
"fit_html": null,
"extracted_content": "",
"metadata": {
"title": null,
"description": null,
"keywords": null,
"author": null
},
"error_message": null,
"session_id": null,
"response_headers": {
"accept-ranges": "bytes",
"cache-control": "max-age=3600, public",
"content-length": "226213",
"content-type": "application/pdf",
"date": "Fri, 14 Feb 2025 11:30:59 GMT",
"expires": "Fri, 14 Feb 2025 12:30:59 GMT",
"last-modified": "Mon, 28 May 2018 04:02:13 GMT",
"server": "OVHcloud",
"vary": "Accept-Encoding"
},
"status_code": null,
"ssl_certificate": null,
"dispatch_result": null,
"redirected_url": "https://www.electeursenherbe.fr/wp-content/uploads/2018/05/1LaPriseDeDecisionCollective_fichier_test-de-la-nasa.pdf"
}
}
@loorisr Docker is a little behind w.r.t to the core library. Just a headsup, we are releasing a new docker setup in the upcoming versions. I'll also investigate what's going in this particular case.
thanks!
thanks! Did it work now? I want crawl a pdf ,the log look likes yours, crawl4ai version is 0.5.0
using current docker image, v0.6.0rc1, I have the following output:
{
"url": "https://red.educagri.fr/wp-content/uploads/2014/10/Perdu-sur-la-lune-2.pdf",
"html": "<!DOCTYPE html><html><head></head><body style=\"height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);\"><embed name=\"6883732B0D11540CA2051E108DE951B8\" style=\"position:absolute; left: 0; top: 0;\" width=\"100%\" height=\"100%\" src=\"about:blank\" type=\"application/pdf\" internalid=\"6883732B0D11540CA2051E108DE951B8\"></body></html>",
"success": true,
"cleaned_html": "",
"media": {
"images": [],
"videos": [],
"audios": [],
"tables": []
}
@loorisr Thanks for taking the time to verify this. I'll check this out today!
@loorisr
Hi. The Docker version doesn’t currently support PDF and crawler strategies. We’ll add it to our backlog for future updates.
@aravindkarnam pls help to add it to our backlog
I will close this issue, but feel free to continue the conversation.