crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: Crawl returning 'str' object has no attribute 'choices

Open jacobshenn opened this issue 8 months ago โ€ข 20 comments

crawl4ai version

0.5.0

Expected Behavior

Return a normal crawl matching my schema.

Current Behavior

I am crawling a set of about 600 links. For some links, the crawl works perfectly, but for others, the crawler returns: [ { "index": 0, "error": true, "tags": [ "error" ], "content": "'str' object has no attribute 'choices'" } ]

there is no pattern for which links the crawler returns this for which makes me wonder whether this is an API issue. Has anyone seen or encountered this bug?

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets


OS

mac)S

Python version

3.13

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

jacobshenn avatar Apr 12 '25 21:04 jacobshenn

Same here. Using deepseek chat API

RaccoonOnion avatar Apr 15 '25 04:04 RaccoonOnion

Same here. Using deepseek chat API

Hey! Thanks for replying. Are you using the deepseek API inside crawl4ai? or are you using it standalone?

  • Thanks

jacobshenn avatar Apr 15 '25 18:04 jacobshenn

Same here. Using deepseek chat API

Hey! Thanks for replying. Are you using the deepseek API inside crawl4ai? or are you using it standalone?

  • Thanks

Inside LLMExtractionStrategy as:

llm_strategy = LLMExtractionStrategy(
    llm_config=LLMConfig(provider="deepseek/deepseek-chat", api_token=os.getenv("DEEPSEEK_API")),
    schema=LeaderboardEntry.model_json_schema(),
    extraction_type="schema",
    instruction=INSTRUCTION_TO_LLM,
    chunk_token_threshold=1000,
    overlap_rate=0.0,
    apply_chunking=True,
    input_format="markdown",
    extra_args={"temperature": 0.0, "max_tokens": 2048},
)

RaccoonOnion avatar Apr 15 '25 18:04 RaccoonOnion

I'm running my API's through openrouter and getting this error.

` llm_strategy = LLMExtractionStrategy( llm_config=LLMConfig( provider="deepseek/deepseek-chat", api_token="os.getenv("openrouter")), base_url="https://openrouter.ai/api/v1",

    )`

Output:

{'index': 0, 'error': True, 'tags': ['error'], 'content': 'litellm.BadRequestError: DeepseekException - {"error":{"message":"deepseek-chat is not a valid model ID","code":400},"user_id":"user_2th1C5iID3WInICREZPY1NCmXhb"}'}

Have you ran into this at all?

jacobshenn avatar Apr 15 '25 19:04 jacobshenn

Yes, I'm facing the same issue with DeepSeek model through groq Api

dhruvthak3r avatar Apr 16 '25 11:04 dhruvthak3r

Same issue on version 0.5.0.post8:

    llm_strategy = LLMExtractionStrategy(
        llm_config=LLMConfig(
            provider="gemini/gemini-2.0-flash",
            api_token=API_KEY,
        ),
        schema=SportOffersList.model_json_schema(),
        extraction_type="schema",
        instruction=SPORT_OFFER_DATA_PROMPT,
        # chunk_token_threshold=1000,
        # overlap_rate=0.0,
        # apply_chunking=False,
        input_format="markdown",
        extra_args={"temperature": 0},
        verbose=True,
    )

    # 2. Build the crawler config
    crawl_config = CrawlerRunConfig(
        extraction_strategy=llm_strategy,
        cache_mode=CacheMode.DISABLED,
        # exclude_external_links=True,
        # remove_overlay_elements=True,
    )

It worked with 0.5.0.post4

Blackvz avatar Apr 18 '25 09:04 Blackvz

Having the same issue, I switched to 0.5.0.post4 as mentioned by @Blackvz and it worked

garfieldcoked avatar Apr 21 '25 13:04 garfieldcoked

Im on 0.6.0 (docker /crawl) and keep getting:

"extracted_content": 
"[\n    {\n        "index": 0,\n        "error": true,\n        "tags": [\n            "error"\n        ],\n        "content": "'str' object has no attribute 'choices'"\n    }\n]",

Have seen this with openrouter models and now tried with gemini (ai studio) and getting the same.

This is my request body:

{
    "urls":  ["__url__"],
    "browser_config": {
        "type": "BrowserConfig",
        "params": {
            "headless": true, 
            "viewport": {
                "type": "dict",
                "value": {
                    "width": 1200,
                    "height": 800
                }
            }
        }
    },
    "crawler_config": {
        "type": "CrawlerRunConfig",
        "params": {
            "css_selector": "main",
            "extraction_strategy": {
                "type": "LLMExtractionStrategy",
                "params": {
                   "input_format": "markdown",
                    "llm_config": {
                        "type": "LLMConfig",
                        "params": {
                            "provider": "gemini/gemini-2.5-flash-preview-04-17",
                            "api_token": "__key__"   
                        }
                    },
                    "schema": {
                        "title": "IndexPageLinks",
                        "type": "object",
                        "properties": {
                            "links": {
                                "title": "Links",
                                "type": "array",
                                "description": "List of links found on the page.",
                                "items": {
                                    "type": "string"
                                }
                            }
                        },
                        "required": [
                            "links"
                        ]
                    },
                    "extraction_type": "schema",
                    "instruction": "Scan the content of the page for interesting links. Extract the top 5 most important links found on the page and return them"
                }
            }
        }
    }
}

yoavniran avatar Apr 25 '25 12:04 yoavniran

same for me

SECVBulRep avatar Apr 27 '25 11:04 SECVBulRep

Same here. Using

ollama run deepseek-r1

nciefeiniu avatar Apr 28 '25 07:04 nciefeiniu

the same issue,but the reason is diffrent, when I use the api.deepseek.com(the official net of deepseek), the program return the right result,but the deepseek-r1:32 and qwq32b would return this error.my program is crawling data from the pdf url.so , the problem is prompt,or llm?

ROBODRILL avatar Apr 28 '25 07:04 ROBODRILL

The error occurs because some models return reasoning_content instead of content in crawl4ai/extraction_strategy.py (response = response.choices[0].message.content). Try using a different model to resolve this

quangvinh2080 avatar Apr 28 '25 11:04 quangvinh2080

Looks like the the response variable get's redefined and that causes an issue with this try-except block

I was able to resolve this issue for myself by changing the try-except block in .venv\Lib\site-packages\crawl4ai\extraction_strategy.py on line 657

Just commented the re-definition and refer to response.choices[0].message.content where the redefined response variable was referenced

            try:
                # response = response.choices[0].message.content
                blocks = None

                if self.force_json_response:
                    blocks = json.loads(response.choices[0].message.content)
                    if isinstance(blocks, dict):
                        # If it has only one key which calue is list then assign that to blocks, exampled: {"news": [..]}
                        if len(blocks) == 1 and isinstance(list(blocks.values())[0], list):
                            blocks = list(blocks.values())[0]
                        else:
                            # If it has only one key which value is not list then assign that to blocks, exampled: { "article_id": "1234", ... }
                            blocks = [blocks]
                    elif isinstance(blocks, list):
                        # If it is a list then assign that to blocks
                        blocks = blocks
                else: 
                    blocks = extract_xml_data(["blocks"], response.choices[0].message.content)["blocks"]
                    # blocks = extract_xml_data(["blocks"], response)["blocks"]
                    blocks = json.loads(blocks)

                for block in blocks:
                    block["error"] = False
            except Exception:
                parsed, unparsed = split_and_parse_json_objects(
                    response.choices[0].message.content
                )
                blocks = parsed
                if unparsed:
                    blocks.append(
                        {"index": 0, "error": True, "tags": ["error"], "content": unparsed}
                    )

thanks @quangvinh2080 for the lead

JWBWork avatar May 05 '25 17:05 JWBWork

Thanks @JWBWork that was indeed the root cause for my errors too.

I think it should be fine to leave the redefinition of response as long as the exception block simply uses response, as an argument of split_and_parse_json_objects:

try:
  response = response.choices[0].message.content
  blocks = None
  ...
except Exception:
parsed, unparsed = split_and_parse_json_objects(response)
...

This scenario assumes that the exception is not raised as part of accessing response.choices[0].message.content. I'm not super familiar with LiteLLM, so feel free to correct me if I'm wrong here.

mbalasz avatar May 06 '25 13:05 mbalasz

@JWBWork's changes fixed the initial problems for me. Yes, if exception also used response instead of assuming it's still a ModelResponse that would fix causing a second exception when handling the first.

Debugging my problems I have figured out that my gemma3 model locally is returning XML instead of JSON in the <blocks>...</blocks> (despite the prompt saying not to do that), which the blocks = json.loads(blocks) line can't handle.

The way this is handling the errors makes it very difficult to realize it's actually the LLM just returning bad information.

chrisportela avatar May 07 '25 05:05 chrisportela

Debugging my problems I have figured out that my gemma3 model locally is returning XML instead of JSON in the ... (despite the prompt saying not to do that), which the blocks = json.loads(blocks) line can't handle.

The way this is handling the errors makes it very difficult to realize it's actually the LLM just returning bad information.

+1 on that. I had exact same issue with the model's "thinking" response including the <blocks> tag and only after debugging for a while I found this root cause.

mbalasz avatar May 07 '25 20:05 mbalasz

@JWBWork โ€” thanks a ton for digging into this! ๐Ÿ™
I can confirm that the bug reproduces 100 % with the docs-example script below (only my OPENAI_API_KEY and the test URL were changed).

# reproduce.py
import os, asyncio
from dotenv import load_dotenv
from pydantic import BaseModel, Field

from crawl4ai import (
    AsyncWebCrawler, CrawlerRunConfig, LLMConfig,
    CacheMode, BrowserConfig
)
from crawl4ai.extraction_strategy import LLMExtractionStrategy

load_dotenv(".env.txt")  # just holds OPENAI_API_KEY


class EmployeeNamesSchema(BaseModel):
    employee_names: list = Field(..., description="List of employee or owner names")


async def test():
    crawler_cfg = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        word_count_threshold=1,
        page_timeout=80_000,
        extraction_strategy=LLMExtractionStrategy(
            llm_config=LLMConfig(
                provider="openai/gpt-4o-mini",
                api_token=os.getenv("OPENAI_API_KEY"),
            ),
            schema=EmployeeNamesSchema.model_json_schema(),
            extraction_type="schema",
            instruction='Extract all employee names from the page and return {"employee_names":[...]}.',
            extra_args={"temperature": 0, "max_tokens": 512},
        ),
    )

    async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
        res = await crawler.arun(
            "https://2jtandartspraktijk.tandartsennet.nl/team/",
            config=crawler_cfg,
        )
        print(res.extracted_content)


if __name__ == "__main__":
    asyncio.run(test())

Result on main:

[
  {
    "index": 0,
    "error": true,
    "tags": ["error"],
    "content": "'str' object has no attribute 'choices'"
  }
]

Result after applying your one-liner from PR #980

{"employee_names": ["Dr. El Zowini", "T. de Groot", "T. Vorstenbosch", "I. de Jong"]}

So the issue isnโ€™t in user codeโ€”itโ€™s in the current repo head. Could we merge the patch (and maybe cut a quick point-release) so new users donโ€™t hit the same wall?

/cc @unclecode for visibility ๐Ÿš€

JaphetSt avatar May 09 '25 08:05 JaphetSt

Thank you all for your assistance in identifying the root cause of this issue. I will work on a fix, which will be included in our next release. I will most likely add the fix to the May bug-fix branch. Iโ€™ll keep everyone updated on the progress here.

cc @aravindkarnam

ntohidi avatar May 09 '25 13:05 ntohidi

Chinese discussion forum implementing fix for same problem: https://linux.do/t/topic/561221

n000b3r avatar May 15 '25 03:05 n000b3r

Iโ€™ve resolved the issue. The fix is now available in the 2025-MAY-2 branch.

ntohidi avatar May 16 '25 07:05 ntohidi

@ntohidi, thanks for fixing! Is there a pre-release version I can install with the 2025-MAY-2 code? Struggling to get a git branch-based pip installation to work.

richardgirges avatar May 31 '25 13:05 richardgirges

@richardgirges This is now merged to main branch in v0.7!

aravindkarnam avatar Jul 13 '25 13:07 aravindkarnam