[Bug]: 429 Unprocessable entity when trying to process raw html with docker crawl4ai
crawl4ai version
unclecode/crawl4ai:gpu-arm64 212ac5ff9bb1
Expected Behavior
i get back the structured data from the crawled html.
Current Behavior
When I call crawl4ai crawl API with url in the format "raw:<html .... " I get back the error 429 Unprocessable entity.
If I crawl with URLs this works fine.
Is this reproducible?
Yes
Inputs Causing the Bug
Json post data is
`{"urls":"raw:\u003Chtml\u003E\u003Cbody\u003E\u003Ch1\u003EHello, World!\u003C/h1\u003E\u003C/body\u003E\u003C/html\u003E","bypass_cache":"True","CacheMode":"CacheMode.DISABLED","extraction_config":{"type":"llm","extraction_type":"block","input_format":"html","params":{"schema":"\r\n {\r\n \u0022$schema\u0022: \u0022http://json-schema.org/draft-07/schema#\u0022,\r\n \u0022type\u0022: \u0022object\u0022,\r\n \u0022properties\u0022: {\r\n \u0022name\u0022: {\r\n \u0022type\u0022: \u0022string\u0022,\r\n \u0022description\u0022: \u0022The full name of the person.\u0022\r\n },\r\n \u0022psychotherapyModality\u0022: {\r\n \u0022type\u0022: \u0022string\u0022,\r\n \u0022description\u0022: \u0022The psychotherapy modality the person practices.\u0022,\r\n \u0022enum\u0022: [\r\n \u0022ge\u0161talt terapija\u0022,\r\n \u0022integrativna terapija\u0022,\r\n \u0022integrativna relacijska interapija\u0022,\r\n \u0022sistemska terapija\u0022,\r\n \u0022dru\u017Einska terapija\u0022,\r\n \u0022transakcijska analiza\u0022,\r\n \u0022realitetna terapija\u0022,\r\n \u0022kognitivno vedenjska terapija\u0022,\r\n \u0022psihodinamska terapija\u0022,\r\n \u0022psihoanaliti\u010Dna terapija\u0022,\r\n \u0022jungovska terapija\u0022,\r\n \u0022telesna terapija\u0022,\r\n \u0022plesno gibalna terapija\u0022,\r\n \u0022psihodrama\u0022,\r\n \u0022logoterapija\u0022,\r\n \u0022telesno orientirana psihoterapija\u0022,\r\n \u0022drugo\u0022\r\n ]\r\n },\r\n \u0022email\u0022: {\r\n \u0022type\u0022: \u0022string\u0022,\r\n \u0022format\u0022: \u0022email\u0022,\r\n \u0022description\u0022: \u0022The email address of the person.\u0022\r\n },\r\n \u0022phone\u0022: {\r\n \u0022type\u0022: \u0022string\u0022,\r\n \u0022description\u0022: \u0022The phone number of the person.\u0022\r\n },\r\n \u0022address\u0022: {\r\n \u0022type\u0022: \u0022string\u0022,\r\n \u0022description\u0022: \u0022The address of the person.\u0022\r\n },\r\n \r\n \u0022rolesArray\u0022: {\r\n \u0022type\u0022: \u0022array\u0022,\r\n \u0022items\u0022: {\r\n \u0022type\u0022: \u0022string\u0022,\r\n \u0022enum\u0022: [\u0022psihoterapevt\u0022, \u0022specializant\u0022, \u0022terapevt sta\u017Eist ZDT\u0022, \u0022mentor ZDT\u0022, \u0022supervizor\u0022],\r\n \u0022description\u0022: \u0022The role of the person, either \u0027psihoterapevt\u0027 or \u0027specializant\u0027.\u0022\r\n }\r\n }\r\n }\r\n },\r\n \u0022required\u0022: [\u0022name\u0022, \u0022psychotherapyModality\u0022, \u0022email\u0022, \u0022phone\u0022, \u0022address\u0022, \u0022role\u0022],\r\n \u0022additionalProperties\u0022: false\r\n }","input_format":"html","provider":"gemini/gemini-2.0-flash-exp","api_token":"xxxx","api_key":"xxxx","instruction":"extract the structured data from this page","set_verbose":"True"}}}`
Steps to Reproduce
Code snippets
OS
macOs docker
Python version
whatever is in docker image
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
2025-02-10 09:45:55 INFO: 172.17.0.1:60318 - "POST /crawl HTTP/1.1" 200 OK 2025-02-10 09:45:55 INFO: 172.17.0.1:60318 - "GET /task/54f4a9db-ff8c-4440-a299-6d18f2645c3f HTTP/1.1" 200 OK 2025-02-10 09:46:00 INFO: 172.17.0.1:64766 - "GET /task/54f4a9db-ff8c-4440-a299-6d18f2645c3f HTTP/1.1" 200 OK 2025-02-10 13:06:45 INFO: 172.17.0.1:56434 - "POST /crawl HTTP/1.1" 422 Unprocessable Entity
same if I try to use a file:// like this "file:///var/folders/pl/y8ls7xl173z4nj66m_v0r5t80000gn/T/tmpzJadgZ.tmp "
@clarity99 Does this behaviour happen only in docker or does it also happen when you directly use Crawl4AI(for same inputs)