crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: LLMExtractionStrategy ratelimit results in no attribute usage

Open stevenh opened this issue 7 months ago • 1 comments

crawl4ai version

0.5.0.post8

Expected Behavior

If rate limit is hit the user should be informed

Current Behavior

When the rate limit exceeds the retries perform_completion_with_backoff returns a list which is not handled by LLMExtractionStrategy.extract resulting in it trying to access usage data field which doesn't exist which results in the extracted_content being:

[
    {
        "index": 0,
        "error": true,
        "tags": [
            "error"
        ],
        "content": "\'list\' object has no attribute \'usage\'"
    }
]

Is this reproducible?

Yes

Inputs Causing the Bug

Any request which results in a ratelimit for more than 2 retries.

Steps to Reproduce

Perform an crawl using LLMExtractionStrategy.

Code snippets

"""Test LLM extraction strategy for job postings."""

import json
import logging
import os
import sys
from typing import TYPE_CHECKING, Any

from crawl4ai import AsyncWebCrawler, BrowserConfig, CacheMode, CrawlerRunConfig
from crawl4ai.async_configs import LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
import pytest

if TYPE_CHECKING:
    from crawl4ai.models import CrawlResult

_LOGGER: logging.Logger = logging.getLogger(__name__)


class JobRequirement(BaseModel):
    """Schema for job requirements."""

    category: str = Field(
        description="Category of the requirement (e.g., Technical, Soft Skills)",
    )
    items: list[str] = Field(
        description="List of specific requirements in this category",
    )
    priority: str = Field(
        description="Priority level (Required/Preferred) based on the HTML class or context",
    )


class JobPosting(BaseModel):
    """Schema for job postings."""

    title: str = Field(description="Job title")
    department: str = Field(description="Department or team")
    location: str = Field(description="Job location, including remote options")
    salary_range: str | None = Field(description="Salary range if specified")
    requirements: list[JobRequirement] = Field(
        description="Categorized job requirements",
    )
    application_deadline: str | None = Field(
        description="Application deadline if specified",
    )
    contact_info: dict | None = Field(
        description="Contact information from footer or contact section",
    )


@pytest.mark.asyncio
async def test_llm_extraction() -> None:
    """Crawl job postings and extract details."""
    api_key: str | None = os.environ.get("OPENAI_API_KEY")
    if not api_key:
        msg: str = "OPENAI_API_KEY environment variable not set"
        raise ValueError(msg)

    browser_config: BrowserConfig = BrowserConfig(
        verbose=False,
        extra_args=[
            "--disable-gpu",
            "--disable-dev-shm-usage",
            "--no-sandbox",
        ],
    )

    extraction_strategy: LLMExtractionStrategy = LLMExtractionStrategy(
        llm_config=LLMConfig(
            provider="openai/gpt-4o",
            api_token=api_key,
        ),
        schema=JobPosting.model_json_schema(),
        extraction_type="schema",
        instruction="""
        Extract job posting details, using HTML structure to:
        1. Identify requirement priorities from CSS classes (e.g., 'required' vs 'preferred')
        2. Extract contact info from the page footer or dedicated contact section
        3. Parse salary information from specially formatted elements
        4. Determine application deadline from timestamp or date elements

        Use HTML attributes and classes to enhance extraction accuracy.
        """,
        input_format="html",
        # chunk_token_threshold=chunk_token_threshold,
    )

    config: CrawlerRunConfig = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        stream=True,
        extraction_strategy=extraction_strategy,
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result: CrawlResult
        async for result in await crawler.arun_many(
            urls=[
                "https://www.rocketscience.gg/careers/c77fbdec-fce6-44f1-a05e-8cd76325a1a0/",
            ],
            config=config,
        ):
            assert result.success
            assert result.extracted_content
            extracted_content: list[dict[str, Any]] = json.loads(result.extracted_content)
            assert len(extracted_content) == 1


if __name__ == "__main__":
    import subprocess

    sys.exit(subprocess.call(["pytest", *sys.argv[1:], sys.argv[0]]))  # noqa: S603, S607

OS

macOS

Python version

3.12.9

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

platform darwin -- Python 3.12.9, pytest-8.3.5, pluggy-1.5.0
rootdir: xxx
configfile: pyproject.toml
plugins: anyio-4.9.0, logfire-3.12.0, pytest_httpserver-1.1.3, asyncio-0.26.0, mock-3.14.0
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
collected 1 item

tests/test_extraction.py [FETCH]... ↓ http://localhost:51368/engineering-manager... | Status: True | Time: 0.74s
[SCRAPE].. ◆ http://localhost:51368/engineering-manager... | Time: 0.096s

Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Rate limit error: litellm.RateLimitError: RateLimitError: OpenAIException - Request too large for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Requested 75303. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.
Waiting for 2 seconds before retrying...

Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Rate limit error: litellm.RateLimitError: RateLimitError: OpenAIException - Request too large for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Requested 75303. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.
Waiting for 4 seconds before retrying...

Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

Rate limit error: litellm.RateLimitError: RateLimitError: OpenAIException - Request too large for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Requested 75303. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.
[EXTRACT]. ■ Completed for http://localhost:51368/engineering-manager... | Time: 14.378983207978308s
[COMPLETE] ● http://localhost:51368/engineering-manager... | Status: True | Total: 15.22s

stevenh avatar Apr 15 '25 12:04 stevenh

RCA

The core problem is the mismatch between the error return type of perform_completion_with_backoff, in crawl4ai/utils.py file, (a list), and the expected success type within LLMExtractionStrategy.extract, crawl4ai/extraction_strategy.py, (a LiteLLM completion object). The error handling within LLMExtractionStrategy.extract is not robust enough to handle the specific error structure returned by the utility function, leading to a new error (AttributeError) that obscures the original problem (rate limit).

cc @aravindkarnam

ntohidi avatar May 12 '25 10:05 ntohidi

@Ahmed-Tawfik94 you can use this link to test. the one mentioned above is not responding anymore:

https://www.rocketscience.gg/careers/69f33f5d-bab1-494e-9f04-1146d7944b49/

ntohidi avatar Aug 08 '25 08:08 ntohidi

Hi @stevenh could you try if you still facing this issue on the latest build v0.7.2

Ahmed-Tawfik94 avatar Aug 08 '25 09:08 Ahmed-Tawfik94

Sorry @Ahmed-Tawfik94 we had to take the decision to migrate away from crawl4ai as there where too many issues and all the work we put into to help fixing them never got any traction.

If anyone wants to pick up the work, feel free otherwise I'll close out all the PRs.

stevenh avatar Aug 08 '25 12:08 stevenh

already fixed and the latest version (0.7.4)

ntohidi avatar Aug 18 '25 04:08 ntohidi

I'm using 0.7.4 and still got this error

 {'index': 0, 'error': True, 'tags': ['error'], 'content': "'list' object has no attribute 'usage'"}

and manually apply https://github.com/unclecode/crawl4ai/pull/990 can resolve the problem. Please consider accept that PR (although I have tried to poke some reviewers some months ago)

lance6716 avatar Aug 28 '25 15:08 lance6716

Fixed in the newest release, 0.7.6.

ntohidi avatar Oct 22 '25 13:10 ntohidi