crawl4ai
crawl4ai copied to clipboard
[Bug]: AdaptiveCrawler - Embedding strategy broken: query expansion uses hardcoded mock data instead of LLM
crawl4ai version
0.7.7
Expected Behavior
When using the embedding strategy in AdaptiveCrawler, the system should:
- Take the user's query (e.g., "who is on the board of directors")
- Use an LLM to generate semantic variations of that query (e.g., "executive team members", "leadership board composition", "directors and executives")
- Create embeddings for these query variations to represent the semantic space of what the user is looking for
- Use these embeddings to: - Score and select relevant links to crawl - Measure coverage of the query space - Determine when sufficient information has been gathered
- Return meaningful confidence scores based on actual semantic similarity between crawled content and the query
Current Behavior
The map_query_semantic_space function (line 731 in adaptive_crawler.py) has the LLM call commented out and uses hardcoded mock data instead:
# The actual LLM call is disabled:
# response = perform_completion_with_backoff(
# provider=provider,
# prompt_with_variables=prompt,
# api_token=api_token,
# json_response=True
# )
# Hardcoded mock data is used instead:
variations = {'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', ...]}
This causes:
- All queries are compared against embeddings about "vegetable fried rice" regardless of actual user query
- Zero semantic similarity between user's actual query and the mock query variations
- Crawler immediately stops with 0% confidence, thinking content is completely irrelevant
- Stats show: Unique Terms: 0, Total Terms: 0, Pages Crawled: 1, Confidence: 0.00%
- The embedding strategy is completely non-functional for any real-world use case
Example:
# User's query: "who is on the board of directors"
# System actually searches for: "vegetable fried rice recipes"
# Result: 0% match, stops immediately
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
import asyncio
from crawl4ai.async_webcrawler import AsyncWebCrawler
from crawl4ai.adaptive_crawler import AdaptiveConfig, AdaptiveCrawler
from crawl4ai.async_configs import LLMConfig
import os
async def run_adaptive_crawl():
start_url = "https://danica.dk/en/personal"
query = "who is on the board of directors"
adaptive_config = AdaptiveConfig(
strategy="embedding",
confidence_threshold=0.8,
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
)
async with AsyncWebCrawler() as crawler:
adaptive_crawler = AdaptiveCrawler(
crawler=crawler,
config=adaptive_config
)
print(f"Starting adaptive crawl for URL: {start_url} with query: '{query}'")
print(f"Using strategy: {adaptive_config.strategy}")
crawl_state = await adaptive_crawler.digest(
start_url=start_url,
query=query
)
print("\nAdaptive Crawl Completed.")
adaptive_crawler.print_stats(detailed=True)
if __name__ == "__main__":
asyncio.run(run_adaptive_crawl())
OS
macOS
Python version
3.12
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response