Input Length Exceeds Maximum Limit in LLama:8B Model API (Deep Infra)
Here's a clearer version of your message:
Hi,
I am using Deep Infra's model API, specifically the LLama:8B model, to scrape product data from e-commerce websites. However, for certain websites like amazon etc, I encounter the following error:
json Copy code { "index": 0, "error": true, "tags": [ "error" ], "content": "litellm.APIError: APIError: DeepinfraException - Error code: 500 - {'error': {'message': 'Requested input length 10452 exceeds maximum input length 8191'}}" } Is there a way to increase the input length, or can the model's structure handle longer inputs? If not, do you recommend any strategies for managing this limitation?
@sanchitsingh001 I assume you are using the LLM extraction strategy, such limits relates to the model. However, you can fix the issue in certain ways. The LLM extraction strategy can chunk the content into smaller sizes, send each chunk to the LLM in parallel, and then combine the results. You can't adjust that threshold size. Share with me your codesnippet and the URL, and I will show you how to do that.
I am working on a set of new documents where I explain the different strategies you can use. It is currently in draft mode. I will give you the links so you can check them and get some ideas.
https://github.com/unclecode/crawl4ai/blob/main/docs/md_v3/tutorials/json-extraction-basic.md https://github.com/unclecode/crawl4ai/blob/main/docs/md_v3/tutorials/json-extraction-llm.md
Thank you for your detailed response and for sharing the helpful documentation links. I have attached the requested code snippet and the product page URL for reference.
To provide more context, I primarily scrape e-commerce websites like Amazon and eBay to extract product details. However, I have encountered some challenges:
Hallucinated Responses: The structured data returned often includes hallucinated entries. For example, if I provide a page containing a single product, the response may include a list of products that do not exist. Performance Requirements: I need to scrape and process approximately 90-100 product pages at a time, converting the content into structured data within 5 seconds. Achieving this level of performance has been challenging. Given these constraints, I have the following questions:
Is using a more advanced LLM the only way to ensure highly accurate and reliable structured data? For the performance bottleneck, is hardware the primary limitation, or are there additional optimizations I could consider? I know tools like Perplexity and some ChatGPT versions can retrieve and process web data quickly, so I believe this level of efficiency is achievable. Any guidance or resources you could provide to address these challenges would be greatly appreciated.
Here's my code: import os import json import asyncio from crawl4ai import AsyncWebCrawler from crawl4ai.extraction_strategy import LLMExtractionStrategy from pydantic import BaseModel, Field
Define the schema for product details
class ProductDetails(BaseModel): product_name: str = Field(..., description="Name of the product.") price: str = Field(..., description="Price of the product.") rating: str = Field(..., description="Rating of the product.") reviews_count: str = Field(..., description="Number of reviews for the product.") availability: str = Field(..., description="Availability status of the product.") product_description: str = Field(..., description="Detailed description of the product.") features: list[str] = Field(..., description="List of features or specifications of the product.")
Function to extract product details from the webpage
async def extract_product_details(): url = 'https://www.amazon.com/s?k=shoes&crid=1M0PZKQQQ7OYT&sprefix=shoe%2Caps%2C191&ref=nb_sb_noss_2'
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=url,
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
provider="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
api_token="",
base_url="https://api.deepinfra.com/v1/openai",
schema=ProductDetails.model_json_schema(),
extraction_type="schema",
instruction=(
"From the crawled content, extract the product's name, price, rating, number of reviews, "
"availability status, detailed description, and list of features or specifications. "
"Ensure all information is accurate and comprehensive. "
'An example JSON format for a single product: '
'{ "product_name": "PUMA Tazon Running Shoe", "price": "$50.00", "rating": "4.5 stars", '
'"reviews_count": "1,200", "availability": "In Stock", '
'"product_description": "Durable and comfortable running shoes.", '
'"features": ["Rubber sole", "Mesh upper for breathability", "Padded collar and tongue"] }'
)
),
bypass_cache=True,
)
product_details = json.loads(result.extracted_content)
print(f"Extracted product details: {product_details}")
with open(".data/product_details.json", "w", encoding="utf-8") as f:
json.dump(product_details, f, indent=2)
asyncio.run(extract_product_details())
@sanchitsingh001 You're welcom, sure I will take a look on this, coming weekend.
Thank You