crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: 429 Unprocessable entity when trying to process raw html with docker crawl4ai

Open clarity99 opened this issue 10 months ago • 2 comments

crawl4ai version

unclecode/crawl4ai:gpu-arm64 212ac5ff9bb1

Expected Behavior

i get back the structured data from the crawled html.

Current Behavior

When I call crawl4ai crawl API with url in the format "raw:<html .... " I get back the error 429 Unprocessable entity.

If I crawl with URLs this works fine.

Is this reproducible?

Yes

Inputs Causing the Bug

Json post data is 
`{"urls":"raw:\u003Chtml\u003E\u003Cbody\u003E\u003Ch1\u003EHello, World!\u003C/h1\u003E\u003C/body\u003E\u003C/html\u003E","bypass_cache":"True","CacheMode":"CacheMode.DISABLED","extraction_config":{"type":"llm","extraction_type":"block","input_format":"html","params":{"schema":"\r\n                {\r\n                  \u0022$schema\u0022: \u0022http://json-schema.org/draft-07/schema#\u0022,\r\n                  \u0022type\u0022: \u0022object\u0022,\r\n                  \u0022properties\u0022: {\r\n                    \u0022name\u0022: {\r\n                      \u0022type\u0022: \u0022string\u0022,\r\n                      \u0022description\u0022: \u0022The full name of the person.\u0022\r\n                    },\r\n                   \u0022psychotherapyModality\u0022: {\r\n                        \u0022type\u0022: \u0022string\u0022,\r\n                        \u0022description\u0022: \u0022The psychotherapy modality the person practices.\u0022,\r\n                        \u0022enum\u0022: [\r\n                          \u0022ge\u0161talt terapija\u0022,\r\n                          \u0022integrativna terapija\u0022,\r\n                          \u0022integrativna relacijska interapija\u0022,\r\n                          \u0022sistemska terapija\u0022,\r\n                          \u0022dru\u017Einska terapija\u0022,\r\n                          \u0022transakcijska analiza\u0022,\r\n                          \u0022realitetna terapija\u0022,\r\n                          \u0022kognitivno vedenjska terapija\u0022,\r\n                          \u0022psihodinamska terapija\u0022,\r\n                          \u0022psihoanaliti\u010Dna terapija\u0022,\r\n                          \u0022jungovska terapija\u0022,\r\n                          \u0022telesna terapija\u0022,\r\n                          \u0022plesno gibalna terapija\u0022,\r\n                          \u0022psihodrama\u0022,\r\n                          \u0022logoterapija\u0022,\r\n                          \u0022telesno orientirana psihoterapija\u0022,\r\n                          \u0022drugo\u0022\r\n                        ]\r\n                      },\r\n                    \u0022email\u0022: {\r\n                      \u0022type\u0022: \u0022string\u0022,\r\n                      \u0022format\u0022: \u0022email\u0022,\r\n                      \u0022description\u0022: \u0022The email address of the person.\u0022\r\n                    },\r\n                    \u0022phone\u0022: {\r\n                      \u0022type\u0022: \u0022string\u0022,\r\n                      \u0022description\u0022: \u0022The phone number of the person.\u0022\r\n                    },\r\n                    \u0022address\u0022: {\r\n                      \u0022type\u0022: \u0022string\u0022,\r\n                      \u0022description\u0022: \u0022The address of the person.\u0022\r\n                    },\r\n                    \r\n                    \u0022rolesArray\u0022: {\r\n                      \u0022type\u0022: \u0022array\u0022,\r\n                      \u0022items\u0022: {\r\n                        \u0022type\u0022: \u0022string\u0022,\r\n                          \u0022enum\u0022: [\u0022psihoterapevt\u0022, \u0022specializant\u0022, \u0022terapevt sta\u017Eist ZDT\u0022, \u0022mentor ZDT\u0022, \u0022supervizor\u0022],\r\n                          \u0022description\u0022: \u0022The role of the person, either \u0027psihoterapevt\u0027 or \u0027specializant\u0027.\u0022\r\n                        }\r\n                      }\r\n                    }\r\n                  },\r\n                  \u0022required\u0022: [\u0022name\u0022, \u0022psychotherapyModality\u0022, \u0022email\u0022, \u0022phone\u0022, \u0022address\u0022, \u0022role\u0022],\r\n                  \u0022additionalProperties\u0022: false\r\n                }","input_format":"html","provider":"gemini/gemini-2.0-flash-exp","api_token":"xxxx","api_key":"xxxx","instruction":"extract the structured data from this page","set_verbose":"True"}}}`

Steps to Reproduce


Code snippets


OS

macOs docker

Python version

whatever is in docker image

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

2025-02-10 09:45:55 INFO: 172.17.0.1:60318 - "POST /crawl HTTP/1.1" 200 OK 2025-02-10 09:45:55 INFO: 172.17.0.1:60318 - "GET /task/54f4a9db-ff8c-4440-a299-6d18f2645c3f HTTP/1.1" 200 OK 2025-02-10 09:46:00 INFO: 172.17.0.1:64766 - "GET /task/54f4a9db-ff8c-4440-a299-6d18f2645c3f HTTP/1.1" 200 OK 2025-02-10 13:06:45 INFO: 172.17.0.1:56434 - "POST /crawl HTTP/1.1" 422 Unprocessable Entity

clarity99 avatar Feb 10 '25 12:02 clarity99

same if I try to use a file:// like this "file:///var/folders/pl/y8ls7xl173z4nj66m_v0r5t80000gn/T/tmpzJadgZ.tmp "

clarity99 avatar Feb 10 '25 12:02 clarity99

@clarity99 Does this behaviour happen only in docker or does it also happen when you directly use Crawl4AI(for same inputs)

aravindkarnam avatar Feb 11 '25 06:02 aravindkarnam