[Bug]: LLMExtractionStrategy ratelimit results in no attribute usage
crawl4ai version
0.5.0.post8
Expected Behavior
If rate limit is hit the user should be informed
Current Behavior
When the rate limit exceeds the retries perform_completion_with_backoff returns a list which is not handled by LLMExtractionStrategy.extract resulting in it trying to access usage data field which doesn't exist which results in the extracted_content being:
[
{
"index": 0,
"error": true,
"tags": [
"error"
],
"content": "\'list\' object has no attribute \'usage\'"
}
]
Is this reproducible?
Yes
Inputs Causing the Bug
Any request which results in a ratelimit for more than 2 retries.
Steps to Reproduce
Perform an crawl using LLMExtractionStrategy.
Code snippets
"""Test LLM extraction strategy for job postings."""
import json
import logging
import os
import sys
from typing import TYPE_CHECKING, Any
from crawl4ai import AsyncWebCrawler, BrowserConfig, CacheMode, CrawlerRunConfig
from crawl4ai.async_configs import LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
import pytest
if TYPE_CHECKING:
from crawl4ai.models import CrawlResult
_LOGGER: logging.Logger = logging.getLogger(__name__)
class JobRequirement(BaseModel):
"""Schema for job requirements."""
category: str = Field(
description="Category of the requirement (e.g., Technical, Soft Skills)",
)
items: list[str] = Field(
description="List of specific requirements in this category",
)
priority: str = Field(
description="Priority level (Required/Preferred) based on the HTML class or context",
)
class JobPosting(BaseModel):
"""Schema for job postings."""
title: str = Field(description="Job title")
department: str = Field(description="Department or team")
location: str = Field(description="Job location, including remote options")
salary_range: str | None = Field(description="Salary range if specified")
requirements: list[JobRequirement] = Field(
description="Categorized job requirements",
)
application_deadline: str | None = Field(
description="Application deadline if specified",
)
contact_info: dict | None = Field(
description="Contact information from footer or contact section",
)
@pytest.mark.asyncio
async def test_llm_extraction() -> None:
"""Crawl job postings and extract details."""
api_key: str | None = os.environ.get("OPENAI_API_KEY")
if not api_key:
msg: str = "OPENAI_API_KEY environment variable not set"
raise ValueError(msg)
browser_config: BrowserConfig = BrowserConfig(
verbose=False,
extra_args=[
"--disable-gpu",
"--disable-dev-shm-usage",
"--no-sandbox",
],
)
extraction_strategy: LLMExtractionStrategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="openai/gpt-4o",
api_token=api_key,
),
schema=JobPosting.model_json_schema(),
extraction_type="schema",
instruction="""
Extract job posting details, using HTML structure to:
1. Identify requirement priorities from CSS classes (e.g., 'required' vs 'preferred')
2. Extract contact info from the page footer or dedicated contact section
3. Parse salary information from specially formatted elements
4. Determine application deadline from timestamp or date elements
Use HTML attributes and classes to enhance extraction accuracy.
""",
input_format="html",
# chunk_token_threshold=chunk_token_threshold,
)
config: CrawlerRunConfig = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=True,
extraction_strategy=extraction_strategy,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result: CrawlResult
async for result in await crawler.arun_many(
urls=[
"https://www.rocketscience.gg/careers/c77fbdec-fce6-44f1-a05e-8cd76325a1a0/",
],
config=config,
):
assert result.success
assert result.extracted_content
extracted_content: list[dict[str, Any]] = json.loads(result.extracted_content)
assert len(extracted_content) == 1
if __name__ == "__main__":
import subprocess
sys.exit(subprocess.call(["pytest", *sys.argv[1:], sys.argv[0]])) # noqa: S603, S607
OS
macOS
Python version
3.12.9
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
platform darwin -- Python 3.12.9, pytest-8.3.5, pluggy-1.5.0
rootdir: xxx
configfile: pyproject.toml
plugins: anyio-4.9.0, logfire-3.12.0, pytest_httpserver-1.1.3, asyncio-0.26.0, mock-3.14.0
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
collected 1 item
tests/test_extraction.py [FETCH]... ↓ http://localhost:51368/engineering-manager... | Status: True | Time: 0.74s
[SCRAPE].. ◆ http://localhost:51368/engineering-manager... | Time: 0.096s
Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.
Rate limit error: litellm.RateLimitError: RateLimitError: OpenAIException - Request too large for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Requested 75303. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.
Waiting for 2 seconds before retrying...
Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.
Rate limit error: litellm.RateLimitError: RateLimitError: OpenAIException - Request too large for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Requested 75303. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.
Waiting for 4 seconds before retrying...
Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.
Rate limit error: litellm.RateLimitError: RateLimitError: OpenAIException - Request too large for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Requested 75303. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.
[EXTRACT]. ■ Completed for http://localhost:51368/engineering-manager... | Time: 14.378983207978308s
[COMPLETE] ● http://localhost:51368/engineering-manager... | Status: True | Total: 15.22s
RCA
The core problem is the mismatch between the error return type of perform_completion_with_backoff, in crawl4ai/utils.py file, (a list), and the expected success type within LLMExtractionStrategy.extract, crawl4ai/extraction_strategy.py, (a LiteLLM completion object). The error handling within LLMExtractionStrategy.extract is not robust enough to handle the specific error structure returned by the utility function, leading to a new error (AttributeError) that obscures the original problem (rate limit).
cc @aravindkarnam
@Ahmed-Tawfik94 you can use this link to test. the one mentioned above is not responding anymore:
https://www.rocketscience.gg/careers/69f33f5d-bab1-494e-9f04-1146d7944b49/
Hi @stevenh could you try if you still facing this issue on the latest build v0.7.2
Sorry @Ahmed-Tawfik94 we had to take the decision to migrate away from crawl4ai as there where too many issues and all the work we put into to help fixing them never got any traction.
If anyone wants to pick up the work, feel free otherwise I'll close out all the PRs.
already fixed and the latest version (0.7.4)
I'm using 0.7.4 and still got this error
{'index': 0, 'error': True, 'tags': ['error'], 'content': "'list' object has no attribute 'usage'"}
and manually apply https://github.com/unclecode/crawl4ai/pull/990 can resolve the problem. Please consider accept that PR (although I have tried to poke some reviewers some months ago)
Fixed in the newest release, 0.7.6.