crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: arun_many and LLMExtractionStrategy with two URLs lead to 8 hallucinating requests

Open kkarkos opened this issue 10 months ago ‱ 1 comments

crawl4ai version

0.4.248

Expected Behavior

Two requests lead into 2 requests and return what's defined in schema once in extracted_content

Current Behavior

Hi, I'm facing an issue with arun_many where I have two URLs for two product websites and realised each requests cost me 30 cent. It seems the extracted_content contains a list of objects while each object did one call to, in example below, Anthropic (I tested a few provider with same outcome). I understand that the first url fails with am Overloaded error. But the second holds seven objects. In any cases the objects have hallucinating data, while the last one has some usable information.

Anthropic Logs:

Time (GMT+1) ID Model Workspace InputTokens OutputTokens Type Request
2025-02-18 14:34:48 req_01VXcLZPhsjWDUaCuzgGprgA claude-3-haiku-20240307 Default 3561 726 HTTP  
2025-02-18 14:34:38 req_01YSgvNw5x6DQn6rGAaLZboB claude-3-haiku-20240307 Default 16398 472 HTTP  
2025-02-18 14:34:33 req_01ULaSn8autNLbG1RvKgFL7j claude-3-haiku-20240307 Default 71053 684 HTTP  
2025-02-18 14:34:38 req_012XHM7qfJMBx1vGJHejeoU5 claude-3-haiku-20240307 Default 13428 363 HTTP  
2025-02-18 14:34:33 req_012EzX9QFq47GCerfSxz2B3R claude-3-haiku-20240307 Default 10707 923 HTTP  
2025-02-18 14:34:33 req_018RdKhkQoLDBW8dSbMXp8ko claude-3-haiku-20240307 Default 4701 496 HTTP  
2025-02-18 14:34:33 req_01J5T8YcgkLjwXEcyD7YwC8w claude-3-haiku-20240307 Default 4563 539 HTTP  
2025-02-18 14:34:30 req_01LJH5PPPZouFZ9aaPVFjcJP claude-3-haiku-20240307 Default 4437 66 HTTP  

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets

async def _scrape_many_async(urls: List[str]) -> List[Dict]:
 llm_strategy = LLMExtractionStrategy(
        provider=os.getenv("ANTHROPIC_MODEL"),
        api_token=os.getenv("ANTHROPIC_API_KEY"),
        instruction=EXTRACTION_PROMPT,
        extraction_type="schema",
        schema=PRODUCT_SCHEMA,
        chunk_token_threshold=2000,
        overlap_rate=0.1  
    )

config = CrawlerRunConfig(
        remove_overlay_elements=True,
        exclude_external_links=True,
        exclude_social_media_links=True,
        excluded_tags=["nav", "header", "footer"],
        remove_forms=True,
        verbose=True,
        extraction_strategy=llm_strategy,
        cache_mode=CacheMode.BYPASS,
        magic=True,
        stream=False, 
    )

browser_config = BrowserConfig(
        headless=True,
    )


    out = []
    try:               
        async with AsyncWebCrawler(config=browser_config, verbose=True) as crawler:
            logger.info("Starting crawler execution")

            results = await crawler.arun_many(
                urls, config=config
            )

            # print length of results
            print(f"Length of results: {len(results)}")

            # Process all results after completion
            print(f"Length of results: {len(results)}")
            print("Detailed results inspection:")
            for idx, r in enumerate(results):
                print(f"\nResult {idx + 1}:")
                print(f"Success status: {getattr(r, 'success', None)}")
                print(f"URL: {getattr(r, 'url', None)}")
                print(f"Error message: {getattr(r, 'error_message', None)}")

            for r in results:
                if r.success:
                    print(f"Just completed: {r.url}")
                    print(f"Extracted Data: {r.extracted_content}")
                    data = json.loads(r.extracted_content)
                    # Take only the first object if data is a list
                    first_item = data[0] if isinstance(data, list) else data
                    print(f"Using first item: {first_item}")

                    first_item["url"] = r.url
                    first_item["success"] = r.success
                    first_item["error_message"] = r.error_message
                    first_item["status_code"] = r.status_code
                    first_item["scraped_at"] = datetime.utcnow().isoformat()
                    out.append(first_item)
                else:
                    print(f"Failed to crawl {r.url}: {r.error_message}")

            logger.info(f"Completed processing all results")
            return out

OS

macOS

Python version

3.11

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

Code logs: [2025-02-18 14:34:28,234: INFO/MainProcess] Task run_multi_scrape_task[68bb9b46-89b1-4395-b866-3b85147a44b2] received [2025-02-18 14:34:28,885: INFO/ForkPoolWorker-8] Starting batch scrape for 2 URLs [2025-02-18 14:34:28,886: INFO/ForkPoolWorker-8] URLs to scrape: [ "https://www.worldofsweets.de/KitKat-Pink-Lemonade-42g.330819.html", "https://www.steam-time.de/kitkat-pink-lemonade-limited-edition-42g" ] [2025-02-18 14:34:28,888: INFO/ForkPoolWorker-8] Initializing LLM strategy [2025-02-18 14:34:28,889: WARNING/ForkPoolWorker-8] Used Schema: {'type': 'object', 'properties': {'product_name': {'type': 'string', 'description': 'the product name'}, 'brand_name': {'type': 'string', 'description': 'the product brand name'}, 'product_code': {'type': 'string', 'description': 'the product code'}, 'categories': {'type': 'string', 'description': 'all product categories'}, 'description': {'type': 'string', 'description': 'a 500 character description of the product'}, 'features': {'type': 'string', 'description': 'the product features'}, 'price': {'type': 'number', 'description': 'the product price'}, 'currency': {'type': 'string', 'description': 'the price currency'}, 'on_sale': {'type': 'string', 'description': 'whether the product is on sale'}, 'strikethrough_price': {'type': 'string', 'description': 'the product strikethrough price'}, 'in_stock': {'type': 'string', 'description': 'whether the product is currently available'}, 'stock_level': {'type': 'string', 'description': 'the product stock level'}, 'price_per_unit': {'type': 'string', 'description': 'price per unit for comparable products'}, 'bulk_pricing': {'type': 'string', 'description': 'any bulk or quantity discount pricing'}, 'minimum_order_quantity': {'type': 'string', 'description': 'minimum quantity required for order'}, 'price_per_100_grams': {'type': 'string', 'description': 'price per 100 grams'}, 'country_of_origin': {'type': 'string', 'description': 'the country of origin of the product'}, 'pre_order_available': {'type': 'string', 'description': 'whether the product is available for pre-order'}, 'pre_order_date': {'type': 'string', 'description': 'the product pre-order date'}, 'back_in_stock_date': {'type': 'string', 'description': 'the product back in stock date'}, 'shipping_cost': {'type': 'string', 'description': 'the product shipping cost'}, 'shipping_time': {'type': 'string', 'description': 'the product shipping time'}, 'delivery_time': {'type': 'string', 'description': 'the product delivery time'}, 'shipping_restrictions': {'type': 'string', 'description': 'any shipping restrictions or excluded locations'}, 'free_shipping_eligible': {'type': 'string', 'description': 'whether the product is eligible for free shipping'}, 'dimensions': {'type': 'string', 'description': 'the product dimensions'}, 'weight': {'type': 'string', 'description': 'the product weight'}, 'color': {'type': 'string', 'description': 'the product color'}, 'size': {'type': 'string', 'description': 'the product size'}, 'materials': {'type': 'string', 'description': 'materials used in the product'}, 'package_contents': {'type': 'string', 'description': "what's included in the package"}, 'variations': {'type': 'string', 'description': 'available product variations (colors, sizes, etc)'}, 'image_url': {'type': 'string', 'description': 'the product image url'}, 'rating': {'type': 'number', 'description': 'the product rating'}, 'reviews_count': {'type': 'number', 'description': 'the number of reviews for the product'}}, 'required': ['product_name', 'price', 'currency']} [2025-02-18 14:34:28,889: WARNING/ForkPoolWorker-8] Used Instuctions: You are an expert web-scraper and data extractor. Your goal is to retrieve product information from a given web page and return it in a strict JSON format according to a predefined schema. [2025-02-18 14:34:28,890: WARNING/ForkPoolWorker-8] Used Model: anthropic/claude-3-haiku-20240307 [2025-02-18 14:34:28,890: INFO/ForkPoolWorker-8] LLM strategy initialized [2025-02-18 14:34:28,891: INFO/ForkPoolWorker-8] Setting up rate limiter [2025-02-18 14:34:28,891: INFO/ForkPoolWorker-8] Configuring crawler monitor [2025-02-18 14:34:28,893: INFO/ForkPoolWorker-8] Setting up memory-adaptive dispatcher [2025-02-18 14:34:28,894: INFO/ForkPoolWorker-8] Configuring crawler [2025-02-18 14:34:29,673: WARNING/ForkPoolWorker-8] [INIT].... → Crawl4AI 0.4.248 [2025-02-18 14:34:29,674: INFO/ForkPoolWorker-8] Starting crawler execution [2025-02-18 14:34:32,334: WARNING/ForkPoolWorker-8] [FETCH]... ↓ https://www.worldofsweets.de/KitKat-Pink-Lemonade-... | Status: [2025-02-18 14:34:32,334: WARNING/ForkPoolWorker-8] True [2025-02-18 14:34:32,335: WARNING/ForkPoolWorker-8] | Time: 2.66s [2025-02-18 14:34:32,372: WARNING/ForkPoolWorker-8] [SCRAPE].. ◆ Processed https://www.worldofsweets.de/KitKat-Pink-Lemonade-... | Time: 37ms [2025-02-18 14:34:32,862: INFO/ForkPoolWorker-8] HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK" [2025-02-18 14:34:33,283: WARNING/ForkPoolWorker-8] 14:34:33 - LiteLLM:INFO [2025-02-18 14:34:33,283: WARNING/ForkPoolWorker-8] : utils.py:2944 - LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:33,283: INFO/ForkPoolWorker-8] LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:35,173: INFO/ForkPoolWorker-8] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 529 " [2025-02-18 14:34:35,174: WARNING/ForkPoolWorker-8] Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new [2025-02-18 14:34:35,174: WARNING/ForkPoolWorker-8] LiteLLM.Info: If you need to debug this error, use litellm._turn_on_debug()'. [2025-02-18 14:34:35,182: WARNING/ForkPoolWorker-8] [EXTRACT]. ■ Completed for https://www.worldofsweets.de/KitKat-Pink-Lemonade-... | Time: 2.808919875002175s [2025-02-18 14:34:35,183: WARNING/ForkPoolWorker-8] [COMPLETE] ● https://www.worldofsweets.de/KitKat-Pink-Lemonade-... | Status: [2025-02-18 14:34:35,183: WARNING/ForkPoolWorker-8] True [2025-02-18 14:34:35,184: WARNING/ForkPoolWorker-8] | Total: [2025-02-18 14:34:35,184: WARNING/ForkPoolWorker-8] 5.51s [2025-02-18 14:34:35,980: WARNING/ForkPoolWorker-8] [FETCH]... ↓ https://www.steam-time.de/kitkat-pink-lemonade-lim... | Status: [2025-02-18 14:34:35,980: WARNING/ForkPoolWorker-8] True [2025-02-18 14:34:35,981: WARNING/ForkPoolWorker-8] | Time: 6.23s [2025-02-18 14:34:36,525: WARNING/ForkPoolWorker-8] [SCRAPE].. ◆ Processed https://www.steam-time.de/kitkat-pink-lemonade-lim... | Time: 543ms [2025-02-18 14:34:36,529: WARNING/ForkPoolWorker-8] 14:34:36 - LiteLLM:INFO [2025-02-18 14:34:36,532: WARNING/ForkPoolWorker-8] : utils.py:2944 - LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:36,529: INFO/ForkPoolWorker-8] LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:36,536: WARNING/ForkPoolWorker-8] 14:34:36 - LiteLLM:INFO [2025-02-18 14:34:36,536: WARNING/ForkPoolWorker-8] : utils.py:2944 - LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:36,532: INFO/ForkPoolWorker-8] LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:36,537: WARNING/ForkPoolWorker-8] 14:34:36 - LiteLLM:INFO [2025-02-18 14:34:36,538: WARNING/ForkPoolWorker-8] : utils.py:2944 - LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:36,534: INFO/ForkPoolWorker-8] LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:36,539: WARNING/ForkPoolWorker-8] 14:34:36 - LiteLLM:INFO [2025-02-18 14:34:36,540: WARNING/ForkPoolWorker-8] : utils.py:2944 - LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:36,535: INFO/ForkPoolWorker-8] LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:40,954: INFO/ForkPoolWorker-8] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK" [2025-02-18 14:34:40,959: WARNING/ForkPoolWorker-8] 14:34:40 - LiteLLM:INFO [2025-02-18 14:34:40,960: WARNING/ForkPoolWorker-8] : utils.py:1120 - Wrapper: Completed Call, calling success_handler [2025-02-18 14:34:40,959: INFO/ForkPoolWorker-8] Wrapper: Completed Call, calling success_handler [2025-02-18 14:34:40,968: WARNING/ForkPoolWorker-8] 14:34:40 - LiteLLM:INFO [2025-02-18 14:34:40,968: WARNING/ForkPoolWorker-8] : utils.py:2944 - LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:40,968: INFO/ForkPoolWorker-8] LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:41,021: INFO/ForkPoolWorker-8] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK" [2025-02-18 14:34:41,024: WARNING/ForkPoolWorker-8] 14:34:41 - LiteLLM:INFO [2025-02-18 14:34:41,024: WARNING/ForkPoolWorker-8] : utils.py:1120 - Wrapper: Completed Call, calling success_handler [2025-02-18 14:34:41,024: INFO/ForkPoolWorker-8] Wrapper: Completed Call, calling success_handler [2025-02-18 14:34:41,027: WARNING/ForkPoolWorker-8] 14:34:41 - LiteLLM:INFO [2025-02-18 14:34:41,028: WARNING/ForkPoolWorker-8] : utils.py:2944 - LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:41,027: INFO/ForkPoolWorker-8] LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:44,052: INFO/ForkPoolWorker-8] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK" [2025-02-18 14:34:44,062: WARNING/ForkPoolWorker-8] 14:34:44 - LiteLLM:INFO [2025-02-18 14:34:44,063: WARNING/ForkPoolWorker-8] : utils.py:1120 - Wrapper: Completed Call, calling success_handler [2025-02-18 14:34:44,062: INFO/ForkPoolWorker-8] Wrapper: Completed Call, calling success_handler [2025-02-18 14:34:44,072: WARNING/ForkPoolWorker-8] 14:34:44 - LiteLLM:INFO [2025-02-18 14:34:44,072: WARNING/ForkPoolWorker-8] : utils.py:2944 - LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:44,071: INFO/ForkPoolWorker-8] LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:44,905: INFO/ForkPoolWorker-8] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 429 Too Many Requests" [2025-02-18 14:34:44,908: WARNING/ForkPoolWorker-8] Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new [2025-02-18 14:34:44,909: WARNING/ForkPoolWorker-8] LiteLLM.Info: If you need to debug this error, use litellm._turn_on_debug()'. [2025-02-18 14:34:44,922: WARNING/ForkPoolWorker-8] Rate limit error: [2025-02-18 14:34:44,922: WARNING/ForkPoolWorker-8]
[2025-02-18 14:34:44,923: WARNING/ForkPoolWorker-8] litellm.RateLimitError: AnthropicException - {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization’s rate limit of 100,000 input tokens per minute. For details, refer to: https://docs.anthropic.com/en/api/rate-limits; see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}} [2025-02-18 14:34:44,923: WARNING/ForkPoolWorker-8] Waiting for 2 seconds before retrying... [2025-02-18 14:34:45,492: INFO/ForkPoolWorker-8] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK" [2025-02-18 14:34:45,499: WARNING/ForkPoolWorker-8] 14:34:45 - LiteLLM:INFO [2025-02-18 14:34:45,500: WARNING/ForkPoolWorker-8] : utils.py:1120 - Wrapper: Completed Call, calling success_handler [2025-02-18 14:34:45,498: INFO/ForkPoolWorker-8] Wrapper: Completed Call, calling success_handler [2025-02-18 14:34:46,933: WARNING/ForkPoolWorker-8] 14:34:46 - LiteLLM:INFO [2025-02-18 14:34:46,935: INFO/ForkPoolWorker-8] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK" [2025-02-18 14:34:46,935: WARNING/ForkPoolWorker-8] : utils.py:2944 - LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:46,932: INFO/ForkPoolWorker-8] LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:46,942: WARNING/ForkPoolWorker-8] 14:34:46 - LiteLLM:INFO [2025-02-18 14:34:46,943: WARNING/ForkPoolWorker-8] : utils.py:1120 - Wrapper: Completed Call, calling success_handler [2025-02-18 14:34:46,942: INFO/ForkPoolWorker-8] Wrapper: Completed Call, calling success_handler [2025-02-18 14:34:47,189: INFO/ForkPoolWorker-8] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 429 Too Many Requests" [2025-02-18 14:34:47,192: WARNING/ForkPoolWorker-8] Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new [2025-02-18 14:34:47,193: WARNING/ForkPoolWorker-8] LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'. [2025-02-18 14:34:47,206: WARNING/ForkPoolWorker-8] Rate limit error: [2025-02-18 14:34:47,207: WARNING/ForkPoolWorker-8]
[2025-02-18 14:34:47,207: WARNING/ForkPoolWorker-8] litellm.RateLimitError: AnthropicException - {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed your organization’s rate limit of 100,000 input tokens per minute. For details, refer to: https://docs.anthropic.com/en/api/rate-limits; see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}} [2025-02-18 14:34:47,208: WARNING/ForkPoolWorker-8] Waiting for 4 seconds before retrying... [2025-02-18 14:34:48,814: INFO/ForkPoolWorker-8] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK" [2025-02-18 14:34:48,828: WARNING/ForkPoolWorker-8] 14:34:48 - LiteLLM:INFO [2025-02-18 14:34:48,829: WARNING/ForkPoolWorker-8] : utils.py:1120 - Wrapper: Completed Call, calling success_handler [2025-02-18 14:34:48,828: INFO/ForkPoolWorker-8] Wrapper: Completed Call, calling success_handler [2025-02-18 14:34:51,212: WARNING/ForkPoolWorker-8] 14:34:51 - LiteLLM:INFO [2025-02-18 14:34:51,213: WARNING/ForkPoolWorker-8] : utils.py:2944 - LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:51,212: INFO/ForkPoolWorker-8] LiteLLM completion() model= claude-3-haiku-20240307; provider = anthropic [2025-02-18 14:34:57,381: INFO/ForkPoolWorker-8] HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK" [2025-02-18 14:34:57,384: WARNING/ForkPoolWorker-8] 14:34:57 - LiteLLM:INFO [2025-02-18 14:34:57,385: WARNING/ForkPoolWorker-8] : utils.py:1120 - Wrapper: Completed Call, calling success_handler [2025-02-18 14:34:57,384: INFO/ForkPoolWorker-8] Wrapper: Completed Call, calling success_handler [2025-02-18 14:34:57,390: WARNING/ForkPoolWorker-8] [EXTRACT]. ■ Completed for https://www.steam-time.de/kitkat-pink-lemonade-lim... | Time: 20.86351729099988s [2025-02-18 14:34:57,391: WARNING/ForkPoolWorker-8] [COMPLETE] ● https://www.steam-time.de/kitkat-pink-lemonade-lim... | Status: [2025-02-18 14:34:57,392: WARNING/ForkPoolWorker-8] True [2025-02-18 14:34:57,392: WARNING/ForkPoolWorker-8] | Total: [2025-02-18 14:34:57,392: WARNING/ForkPoolWorker-8] 27.64s [2025-02-18 14:34:57,394: WARNING/ForkPoolWorker-8] Length of results: 2 [2025-02-18 14:34:57,395: WARNING/ForkPoolWorker-8] Length of results: 2 [2025-02-18 14:34:57,395: WARNING/ForkPoolWorker-8] Detailed results inspection: [2025-02-18 14:34:57,395: WARNING/ForkPoolWorker-8] Result 1: [2025-02-18 14:34:57,396: WARNING/ForkPoolWorker-8] Success status: True [2025-02-18 14:34:57,396: WARNING/ForkPoolWorker-8] URL: https://www.worldofsweets.de/KitKat-Pink-Lemonade-42g.330819.html [2025-02-18 14:34:57,396: WARNING/ForkPoolWorker-8] Error message: [2025-02-18 14:34:57,397: WARNING/ForkPoolWorker-8] Result 2: [2025-02-18 14:34:57,397: WARNING/ForkPoolWorker-8] Success status: True [2025-02-18 14:34:57,397: WARNING/ForkPoolWorker-8] URL: https://www.steam-time.de/kitkat-pink-lemonade-limited-edition-42g [2025-02-18 14:34:57,397: WARNING/ForkPoolWorker-8] Error message: [2025-02-18 14:34:57,398: WARNING/ForkPoolWorker-8] Just completed: https://www.worldofsweets.de/KitKat-Pink-Lemonade-42g.330819.html [2025-02-18 14:34:57,398: WARNING/ForkPoolWorker-8] Extracted Data: [ { "index": 0, "error": true, "tags": [ "error" ], "content": "litellm.InternalServerError: AnthropicError - {"type":"error","error":{"type":"overloaded_error","message":"Overloaded"}}" } ] [2025-02-18 14:34:57,398: WARNING/ForkPoolWorker-8] Using first item: {'index': 0, 'error': True, 'tags': ['error'], 'content': 'litellm.InternalServerError: AnthropicError - {"type":"error","error":{"type":"overloaded_error","message":"Overloaded"}}'} [2025-02-18 14:34:57,398: WARNING/ForkPoolWorker-8] Just completed: https://www.steam-time.de/kitkat-pink-lemonade-limited-edition-42g [2025-02-18 14:34:57,399: WARNING/ForkPoolWorker-8] Extracted Data: [ { "product_name": "KitKat Pink Lemonade Limited Edition 42g", "brand_name": "KitKat", "product_code": "42g", "categories": "Sweets & Candy", "description": "KitKat Pink Lemonade Limited Edition 42g", "features": "Limited Edition", "price": 0.0, "currency": "EUR", "on_sale": "false", "strikethrough_price": "0.0", "in_stock": "true", "stock_level": "In Stock", "price_per_unit": "0.0", "bulk_pricing": "false", "minimum_order_quantity": "1", "price_per_100_grams": "0.0", "country_of_origin": "unknown", "pre_order_available": "false", "pre_order_date": "unknown", "back_in_stock_date": "unknown", "shipping_cost": "unknown", "shipping_time": "unknown", "delivery_time": "unknown", "shipping_restrictions": "unknown", "free_shipping_eligible": "unknown", "dimensions": "unknown", "weight": "42g", "color": "unknown", "size": "unknown", "materials": "unknown", "package_contents": "unknown", "variations": "unknown", "image_url": "https://cdn.public.steam-time.de/images/info/MMs_Block_Crispy_165g.jpg", "rating": 0.0, "reviews_count": 0 }, { "product_name": "Schokolade − KitKat Pink Lemonade Limited Edition 42g", "brand_name": "KitKat", "product_code": "ST-CT-KIKA-AIO10056", "categories": "Sweets & Candies, Schokolade, KitKat", "description": "KitKat Pink Lemonade Limited Edition 42g", "features": "Sofort versandfĂ€hig, ausreichende StĂŒckzahl, Lieferzeit 1-2 Werktage", "price": 2.79, "currency": "EUR", "on_sale": null, "strikethrough_price": null, "in_stock": "verfĂŒgbar", "stock_level": "ausreichende StĂŒckzahl", "price_per_unit": "66,43 EUR / 1 KG", "bulk_pricing": null, "minimum_order_quantity": null, "price_per_100_grams": "66,43 EUR / 1 KG", "country_of_origin": null, "pre_order_available": null, "pre_order_date": null, "back_in_stock_date": null, "shipping_cost": "zzgl. Versandkosten", "shipping_time": "1-2 Werktage", "delivery_time": null, "shipping_restrictions": null, "free_shipping_eligible": "Gratis Versand ab 34,99 €", "dimensions": null, "weight": "42g", "color": null, "size": null, "materials": null, "package_contents": null, "variations": null, "image_url": "https://cdn.public.steam-time.de/images/org/KitKat_Pink_Lemonade_Limited_Edition_42g.jpg", "rating": null, "reviews_count": null }, { "product_name": "KitKat Pink Lemonade Limited Edition 42g", "brand_name": "KitKat", "product_code": "42g", "categories": "Schokolade", "description": "Erleben Sie mit dem KitKat Pink Lemonade Limited Edition 42g die perfekte Kombination aus fruchtig-erfrischendem Geschmack und der unverwechselbaren Knusprigkeit eines KitKat-Riegels. Diese limitierte Sommeredition begeistert durch ihre zarte Zitronennote, die mit der knusprigen Waffel und der cremigen Schokoladenschicht harmoniert. Ob als sĂŒĂŸe Pause im Alltag oder als erfrischender Genuss an heißen Tagen – der KitKat Pink Lemonade Limited Edition 42g ist der perfekte Begleiter fĂŒr unvergessliche Sommertage.", "features": "Der KitKat Pink Lemonade Limited Edition 42g ĂŒberzeugt mit seiner innovativen Rezeptur, die den Geschmack von spritziger Zitronenlimonade in einen unwiderstehlichen Schokoriegel verwandelt. Die fruchtige Note der Pink Lemonade verleiht der cremigen Schokolade eine leichte, erfrischende SĂŒĂŸe, wĂ€hrend die knusprige Waffel im Inneren fĂŒr den typischen KitKat-Knuspermoment sorgt. Mit seinem zitronig-frischen Aroma und der handlichen GrĂ¶ĂŸe von 42g ist der KitKat Pink Lemonade Limited Edition 42g ideal fĂŒr unterwegs.", "price": 0, "currency": "EUR", "on_sale": "false", "strikethrough_price": "0", "in_stock": "true", "stock_level": "in stock", "price_per_unit": "0", "bulk_pricing": "0", "minimum_order_quantity": "0", "price_per_100_grams": "0", "country_of_origin": "USA", "pre_order_available": "false", "pre_order_date": "0", "back_in_stock_date": "0", "shipping_cost": "0", "shipping_time": "0", "delivery_time": "0", "shipping_restrictions": "0", "free_shipping_eligible": "false", "dimensions": "0", "weight": "42 g", "color": "0", "size": "0", "materials": "Zucker*, pflanzliches Fett (Palmöl, Sheabutter, Sonnenblumenöl, Palmkernöl u./o. Distelöl), Weizenmehl, Magermilch, Maissirupfeststoffe*, Laktose, Schokolade, Aroma, kĂŒnstliches Aroma, Soja-Lecithin*, Emulgator (E476), Salz, Hefe, Backtriebmittel (E500), Farbstoff (E129**)", "package_contents": "1 x KitKat Pink Lemonade Limited Edition 42g", "variations": "0", "image_url": "https://cdn.public.steam-time.de/images/info/KitKat_Pink_Lemonade_Limited_Edition_42g.jpg", "rating": 0, "reviews_count": 0 }, { "product_name": "KitKat Pink Lemonade Limited Edition 42g", "brand_name": "KitKat", "product_code": "42g", "categories": "Sweets & Candy", "description": "KitKat Pink Lemonade Limited Edition 42g", "features": "", "price": 0.0, "currency": "EUR", "on_sale": "", "strikethrough_price": "", "in_stock": "", "stock_level": "", "price_per_unit": "", "bulk_pricing": "", "minimum_order_quantity": "", "price_per_100_grams": "", "country_of_origin": "", "pre_order_available": "", "pre_order_date": "", "back_in_stock_date": "", "shipping_cost": "", "shipping_time": "", "delivery_time": "", "shipping_restrictions": "", "free_shipping_eligible": "", "dimensions": "", "weight": "", "color": "", "size": "", "materials": "", "package_contents": "", "variations": "", "image_url": "", "rating": 0.0, "reviews_count": 0, "error": false }, { "product_name": "KitKat Pink Lemonade Limited Edition 42g", "brand_name": "KitKat", "product_code": "42g", "categories": "Snacks", "description": "KitKat Pink Lemonade Limited Edition 42g", "features": "Limited Edition", "price": 0.0, "currency": "EUR", "on_sale": "false", "strikethrough_price": "0.0", "in_stock": "true", "stock_level": "In Stock", "price_per_unit": "0.0 EUR per 42g", "bulk_pricing": "No", "minimum_order_quantity": "1", "price_per_100_grams": "0.0 EUR per 100g", "country_of_origin": "Unknown", "pre_order_available": "false", "pre_order_date": "Unknown", "back_in_stock_date": "Unknown", "shipping_cost": "Depends on location", "shipping_time": "Depends on location", "delivery_time": "Depends on location", "shipping_restrictions": "Unknown", "free_shipping_eligible": "Unknown", "dimensions": "Unknown", "weight": "42g", "color": "Unknown", "size": "42g", "materials": "Unknown", "package_contents": "1 x KitKat Pink Lemonade Limited Edition 42g", "variations": "None", "image_url": "https://cdn.public.steam-time.de/images/info/target2.jpg", "rating": 0.0, "reviews_count": 0 }, { "product_name": "KitKat Pink Lemonade Limited Edition 42g", "brand_name": "KitKat", "product_code": "42g", "categories": "Chocolate, Candy, Snacks", "description": "The KitKat Pink Lemonade Limited Edition 42g combines the refreshingly fruity and fizzy taste of lemonade with the iconic crunchiness of a KitKat wafer bar - perfect for hot summer days! Enjoy the refreshing Pink Lemonade note that blends into the creamy chocolate and crispy wafer, creating a unique taste experience that elevates your breaks. With its convenient 42g size, the KitKat Pink Lemonade Limited Edition is the ideal on-the-go companion - great for the beach, picnics, or as a sweet snack anytime. This limited-edition summer treat is a must-have for KitKat fans and chocolate lovers alike. Secure your piece of summer now before it's gone!", "features": "Refreshing lemonade flavor, Crispy KitKat wafer, Creamy chocolate, Convenient 42g size, Limited-edition summer release", "price": 0.99, "currency": "EUR", "on_sale": "No", "strikethrough_price": "N/A", "in_stock": "Yes", "stock_level": "In Stock", "price_per_unit": "N/A", "bulk_pricing": "N/A", "minimum_order_quantity": "N/A", "price_per_100_grams": "2.36 EUR", "country_of_origin": "Germany", "pre_order_available": "No", "pre_order_date": "N/A", "back_in_stock_date": "N/A", "shipping_cost": "N/A", "shipping_time": "N/A", "delivery_time": "N/A", "shipping_restrictions": "N/A", "free_shipping_eligible": "N/A", "dimensions": "N/A", "weight": "42g", "color": "N/A", "size": "N/A", "materials": "N/A", "package_contents": "1 KitKat Pink Lemonade bar", "variations": "N/A", "image_url": "https://cdn.public.steam-time.de/images/info/KitKat_Pink_Lemonade_42g.jpg", "rating": "N/A", "reviews_count": "N/A" }, { "product_name": "KitKat Pink Lemonade Limited Edition", "brand_name": "KitKat", "product_code": "42g", "categories": "Sweets, Candy", "description": "Unser Sortiment umfasst E-Zigaretten, E-Liquids, Aromen und Verdampfer aber auch weitere Genussmittel wie Spirituosen, Barzubehör, Importierte SĂŒĂŸigkeiten und mehr. Bei uns finden Sie alles fĂŒr ein vollstĂ€ndiges Genusserlebnis.", "features": "QualitĂ€t muss nicht teuer sein. Wir bieten erstklassige Produkte zu fairen Preisen. Unser engagiertes Team steht Ihnen bei allen Fragen und Problemen rund um die angebotenen Produkte kompetent und freundlich zur Seite. Ihre Zufriedenheit ist unser Ziel. Bei Bedarf bieten wir Ihnen eine kostenlose RĂŒcksendung an. Entdecken Sie regelmĂ€ĂŸig reduzierte Sonderangebote im Outlet-Bereich. Ab einem Bestellwert von 34,99 Euro liefern wir Ihre Produkte kostenlos zu Ihnen nach Hause.", "price": 0, "currency": "EUR", "on_sale": "No", "strikethrough_price": "No", "in_stock": "Yes", "stock_level": "Unknown", "price_per_unit": "Unknown", "bulk_pricing": "No", "minimum_order_quantity": "1", "price_per_100_grams": "Unknown", "country_of_origin": "Unknown", "pre_order_available": "No", "pre_order_date": "Unknown", "back_in_stock_date": "Unknown", "shipping_cost": "Free for orders over 34.99 EUR", "shipping_time": "Unknown", "delivery_time": "Unknown", "shipping_restrictions": "None", "free_shipping_eligible": "Yes", "dimensions": "Unknown", "weight": "42g", "color": "Unknown", "size": "Unknown", "materials": "Unknown", "package_contents": "Unknown", "variations": "Unknown", "image_url": "https://cdn.templates.steam-time.de/steam-time-reloaded-19/images/index_blocks/logo-21.png", "rating": 4.85, "reviews_count": 0 } ] [2025-02-18 14:34:57,400: WARNING/ForkPoolWorker-8] Using first item: {'product_name': 'KitKat Pink Lemonade Limited Edition 42g', 'brand_name': 'KitKat', 'product_code': '42g', 'categories': 'Sweets & Candy', 'description': 'KitKat Pink Lemonade Limited Edition 42g', 'features': 'Limited Edition', 'price': 0.0, 'currency': 'EUR', 'on_sale': 'false', 'strikethrough_price': '0.0', 'in_stock': 'true', 'stock_level': 'In Stock', 'price_per_unit': '0.0', 'bulk_pricing': 'false', 'minimum_order_quantity': '1', 'price_per_100_grams': '0.0', 'country_of_origin': 'unknown', 'pre_order_available': 'false', 'pre_order_date': 'unknown', 'back_in_stock_date': 'unknown', 'shipping_cost': 'unknown', 'shipping_time': 'unknown', 'delivery_time': 'unknown', 'shipping_restrictions': 'unknown', 'free_shipping_eligible': 'unknown', 'dimensions': 'unknown', 'weight': '42g', 'color': 'unknown', 'size': 'unknown', 'materials': 'unknown', 'package_contents': 'unknown', 'variations': 'unknown', 'image_url': 'https://cdn.public.steam-time.de/images/info/MMs_Block_Crispy_165g.jpg', 'rating': 0.0, 'reviews_count': 0} [2025-02-18 14:34:57,400: INFO/ForkPoolWorker-8] Completed processing all results [2025-02-18 14:34:57,503: WARNING/ForkPoolWorker-8] Results: [{'index': 0, 'error': True, 'tags': ['error'], 'content': 'litellm.InternalServerError: AnthropicError - {"type":"error","error":{"type":"overloaded_error","message":"Overloaded"}}', 'url': 'https://www.worldofsweets.de/KitKat-Pink-Lemonade-42g.330819.html', 'success': True, 'error_message': '', 'status_code': 200, 'scraped_at': '2025-02-18T13:34:57.398961'}, {'product_name': 'KitKat Pink Lemonade Limited Edition 42g', 'brand_name': 'KitKat', 'product_code': '42g', 'categories': 'Sweets & Candy', 'description': 'KitKat Pink Lemonade Limited Edition 42g', 'features': 'Limited Edition', 'price': 0.0, 'currency': 'EUR', 'on_sale': 'false', 'strikethrough_price': '0.0', 'in_stock': 'true', 'stock_level': 'In Stock', 'price_per_unit': '0.0', 'bulk_pricing': 'false', 'minimum_order_quantity': '1', 'price_per_100_grams': '0.0', 'country_of_origin': 'unknown', 'pre_order_available': 'false', 'pre_order_date': 'unknown', 'back_in_stock_date': 'unknown', 'shipping_cost': 'unknown', 'shipping_time': 'unknown', 'delivery_time': 'unknown', 'shipping_restrictions': 'unknown', 'free_shipping_eligible': 'unknown', 'dimensions': 'unknown', 'weight': '42g', 'color': 'unknown', 'size': 'unknown', 'materials': 'unknown', 'package_contents': 'unknown', 'variations': 'unknown', 'image_url': 'https://cdn.public.steam-time.de/images/info/MMs_Block_Crispy_165g.jpg', 'rating': 0.0, 'reviews_count': 0, 'url': 'https://www.steam-time.de/kitkat-pink-lemonade-limited-edition-42g', 'success': True, 'error_message': '', 'status_code': 200, 'scraped_at': '2025-02-18T13:34:57.400937'}] [2025-02-18 14:34:57,503: WARNING/ForkPoolWorker-8] Successfully scraped https://www.worldofsweets.de/KitKat-Pink-Lemonade-42g.330819.html [2025-02-18 14:34:57,682: WARNING/ForkPoolWorker-8] Found ProductCompetitor: <db.models.ProductCompetitor object at 0x12031ba90> [2025-02-18 14:34:57,788: INFO/ForkPoolWorker-8] Task run_multi_scrape_task[68bb9b46-89b1-4395-b866-3b85147a44b2] succeeded in 29.545984874999704s: {'status': 'FAILED', 'error': ''price''}

kkarkos avatar Feb 18 '25 13:02 kkarkos

Ok - I found out what the issue is. The page has the main product and them below a list of "recommended" products. So the scrape returns information about ALL products instead of the one I am after. Probably a matter of tweaking the prompt.

EDIT: I did some more testing. I think the issue is that LLMExtractionStrategy must use the markdown under the hood and use blocks or similar. So if there are are 8 blocks about the product. It will return an object for each. If e.g. the price is part of one block it will return it for one object, but if there is no price mentioned in a block it won't return a price.

Maybe it's a matter of tweaking the prompt further.

This helped to remove not needed products: We are only interested in the values for the product called {product_name} with a price higher than 0 and only if a price is found. Do not include any other products in your response

At the moment each block means a call to the LLM which is pretty expensive. Maybe I use Crawl4A to create the markdown and then an LLM for actually creating the JSON instead.

{ "product_name": "KitKat Pink Lemonade 42g", "brand_name": "KitKat", "product_code": "330819", "description": "Die KitKat Pink Lemonade vereint eine knusprige Waffel mit einer HĂŒlle aus weißer Schokolade, die mit einem Pink-Lemonade-Geschmack verfeinert ist. Diese limitierte Edition sorgt fĂŒr ein sĂŒĂŸes und fruchtiges Geschmackserlebnis, das an den Sommer erinnert. Statt einer intensiven Zitronennote dominiert hier ein cremiger Erdbeer- und Zitrusgeschmack, der an fruchtige Limonade erinnert und durch die Knusperwaffel eine perfekte Textur erhĂ€lt.", "price": 299, "currency": "EUR", "on_sale": false, "in_stock": false, "price_per_100_grams": 71.19, "country_of_origin": "USA", "weight": 0.04 } ] [2025-02-19 14:46:48,196: WARNING/ForkPoolWorker-8] Just completed: https://www.steam-time.de/kitkat-pink-lemonade-limited-edition-42g [2025-02-19 14:46:48,197: WARNING/ForkPoolWorker-8] Extracted Data: [ { "product_name": "Kit Kat Pink Lemonade Limited Edition 42g", "brand_name": "Kit Kat", "product_code": "42g", "price": 0, "currency": "EUR" }, { "product_name": "KitKat Pink Lemonade Limited Edition 42g", "brand_name": "KitKat", "product_code": "42g", "categories": [ "Schokolade" ], "description": "Erleben Sie mit dem KitKat Pink Lemonade Limited Edition 42g die perfekte Kombination aus fruchtig-erfrischendem Geschmack und der unverwechselbaren Knusprigkeit eines KitKat-Riegels. Diese limitierte Sommeredition begeistert durch ihre zarte Zitronennote, die mit der knusprigen Waffel und der cremigen Schokoladenschicht harmoniert. Ob als sĂŒĂŸe Pause im Alltag oder als erfrischender Genuss an heißen Tagen – der KitKat Pink Lemonade Limited Edition 42g ist der perfekte Begleiter fĂŒr unvergessliche Sommertage.", "features": [ "Erfrischendes Geschmackserlebnis", "Perfekt fĂŒr den Sommer", "Limitierte VerfĂŒgbarkeit" ], "price": 0, "currency": "EUR", "on_sale": false, "in_stock": true, "stock_level": 1, "price_per_100_grams": 4.76, "country_of_origin": "USA", "dimensions": "42 g", "weight": 0.042, "image_url": "https://cdn.public.steam-time.de/images/info/KitKat_Pink_Lemonade_Limited_Edition_42g.jpg" }, { "product_name": "KitKat Pink Lemonade Limited Edition 42g", "brand_name": "KitKat", "price": 2.79, "currency": "EUR" }, { "product_name": "Kit Kat Pink Lemonade Limited Edition 42g", "brand_name": "Kit Kat", "product_code": "42g", "price": 0, "currency": "EUR" }, { "product_name": "KitKat Pink Lemonade Limited Edition 42g", "brand_name": "KitKat", "product_code": "ST-CT-KIKA-AIO10056", "categories": [ "Sweets & Candies", "Schokolade", "KitKat" ], "description": "Schokolade − KitKat Pink Lemonade Limited Edition 42g", "price": 279, "currency": "EUR", "on_sale": false, "in_stock": true, "stock_level": null, "price_per_unit": 66.43, "price_per_100_grams": 66.43, "weight": 42, "image_url": "https://cdn.public.steam-time.de/images/org/KitKat_Pink_Lemonade_Limited_Edition_42g.jpg" }, { "product_name": "KitKat Pink Lemonade Limited Edition 42g", "brand_name": "KitKat", "product_code": "42g", "description": "Der KitKat Pink Lemonade Limited Edition 42g kombiniert den fruchtig-spritzigen Geschmack von Zitronenlimonade mit der unverwechselbaren Knusprigkeit eines KitKat-Riegels – perfekt fĂŒr heiße Sommertage! Genießen Sie die erfrischende Pink Lemonade-Note, die in der cremigen Schokolade mit knuspriger Waffel verschmilzt – ein einzigartiges Geschmackserlebnis, das Ihre Pausen aufwertet. Mit seiner handlichen GrĂ¶ĂŸe von 42g ist der KitKat Pink Lemonade Limited Edition 42g der perfekte Begleiter fĂŒr unterwegs – ideal fĂŒr den Strand, das Picknick oder einfach als sĂŒĂŸer Snack zwischendurch. Diese sommerliche Edition ist nur fĂŒr kurze Zeit erhĂ€ltlich, sodass sie ein echtes SammlerstĂŒck fĂŒr KitKat-Fans und Schokoladenliebhaber darstellt. Lassen Sie sich von der Kombination aus Pink Lemonade und KitKat-Waffel verfĂŒhren. Der KitKat Pink Lemonade Limited Edition 42g sorgt fĂŒr unvergessliche Genussmomente und bringt den Sommer in jede Pause.", "features": [ "Erfrischender Sommergenuss", "Innovative Rezeptur", "Ideal fĂŒr unterwegs", "Exklusiv und limitiert", "Sommerliche Köstlichkeit" ], "price": 0, "currency": "EUR", "on_sale": false, "in_stock": true, "stock_level": 0, "weight": 42 }

kkarkos avatar Feb 19 '25 11:02 kkarkos

Hi @kkarkos i just want to follow up with you on this issue. Are you still facing the same issue with our new release?

Ahmed-Tawfik94 avatar Aug 04 '25 09:08 Ahmed-Tawfik94

I'll close this issue, but feel free to continue the conversation and tag me if the issue persists with our latest version: 0.7.7.

ntohidi avatar Nov 14 '25 10:11 ntohidi