crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: AdaptiveCrawler - Embedding strategy broken: query expansion uses hardcoded mock data instead of LLM

Open ntohidi opened this issue 3 weeks ago • 1 comments

crawl4ai version

0.7.7

Expected Behavior

When using the embedding strategy in AdaptiveCrawler, the system should:

  1. Take the user's query (e.g., "who is on the board of directors")
  2. Use an LLM to generate semantic variations of that query (e.g., "executive team members", "leadership board composition", "directors and executives")
  3. Create embeddings for these query variations to represent the semantic space of what the user is looking for
  4. Use these embeddings to: - Score and select relevant links to crawl - Measure coverage of the query space - Determine when sufficient information has been gathered
  5. Return meaningful confidence scores based on actual semantic similarity between crawled content and the query

Current Behavior

The map_query_semantic_space function (line 731 in adaptive_crawler.py) has the LLM call commented out and uses hardcoded mock data instead:

  # The actual LLM call is disabled:
  # response = perform_completion_with_backoff(
  #     provider=provider,
  #     prompt_with_variables=prompt,
  #     api_token=api_token,
  #     json_response=True
  # )

  # Hardcoded mock data is used instead:
  variations = {'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', ...]}

This causes:

  • All queries are compared against embeddings about "vegetable fried rice" regardless of actual user query
  • Zero semantic similarity between user's actual query and the mock query variations
  • Crawler immediately stops with 0% confidence, thinking content is completely irrelevant
  • Stats show: Unique Terms: 0, Total Terms: 0, Pages Crawled: 1, Confidence: 0.00%
  • The embedding strategy is completely non-functional for any real-world use case

Example:

  # User's query: "who is on the board of directors"
  # System actually searches for: "vegetable fried rice recipes"
  # Result: 0% match, stops immediately

Is this reproducible?

Yes

Inputs Causing the Bug


Steps to Reproduce


Code snippets

import asyncio
from crawl4ai.async_webcrawler import AsyncWebCrawler
from crawl4ai.adaptive_crawler import AdaptiveConfig, AdaptiveCrawler
from crawl4ai.async_configs import LLMConfig
import os

async def run_adaptive_crawl():
    start_url = "https://danica.dk/en/personal"
    query = "who is on the board of directors"

    adaptive_config = AdaptiveConfig(
        strategy="embedding",
        confidence_threshold=0.8,
        embedding_model="sentence-transformers/all-MiniLM-L6-v2",

    )

    async with AsyncWebCrawler() as crawler:
        adaptive_crawler = AdaptiveCrawler(
            crawler=crawler,
            config=adaptive_config
        )

        print(f"Starting adaptive crawl for URL: {start_url} with query: '{query}'")
        print(f"Using strategy: {adaptive_config.strategy}")

        crawl_state = await adaptive_crawler.digest(
            start_url=start_url,
            query=query
        )

        print("\nAdaptive Crawl Completed.")
        adaptive_crawler.print_stats(detailed=True)

if __name__ == "__main__":
    asyncio.run(run_adaptive_crawl())

OS

macOS

Python version

3.12

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

ntohidi avatar Nov 17 '25 15:11 ntohidi