crawl4ai [Bug]: AdaptiveCrawler - Embedding strategy broken: query expansion uses hardcoded mock data instead of LLM

[Bug]: AdaptiveCrawler - Embedding strategy broken: query expansion uses hardcoded mock data instead of LLM

Open ntohidi opened this issue 3 weeks ago • 1 comments

crawl4ai version

0.7.7

Expected Behavior

When using the embedding strategy in AdaptiveCrawler, the system should:

Take the user's query (e.g., "who is on the board of directors")
Use an LLM to generate semantic variations of that query (e.g., "executive team members", "leadership board composition", "directors and executives")
Create embeddings for these query variations to represent the semantic space of what the user is looking for
Use these embeddings to: - Score and select relevant links to crawl - Measure coverage of the query space - Determine when sufficient information has been gathered
Return meaningful confidence scores based on actual semantic similarity between crawled content and the query

Current Behavior

The map_query_semantic_space function (line 731 in adaptive_crawler.py) has the LLM call commented out and uses hardcoded mock data instead:

  # The actual LLM call is disabled:
  # response = perform_completion_with_backoff(
  #     provider=provider,
  #     prompt_with_variables=prompt,
  #     api_token=api_token,
  #     json_response=True
  # )

  # Hardcoded mock data is used instead:
  variations = {'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', ...]}

This causes:

All queries are compared against embeddings about "vegetable fried rice" regardless of actual user query
Zero semantic similarity between user's actual query and the mock query variations
Crawler immediately stops with 0% confidence, thinking content is completely irrelevant
Stats show: Unique Terms: 0, Total Terms: 0, Pages Crawled: 1, Confidence: 0.00%
The embedding strategy is completely non-functional for any real-world use case

Example:

  # User's query: "who is on the board of directors"
  # System actually searches for: "vegetable fried rice recipes"
  # Result: 0% match, stops immediately

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

Code snippets

import asyncio
from crawl4ai.async_webcrawler import AsyncWebCrawler
from crawl4ai.adaptive_crawler import AdaptiveConfig, AdaptiveCrawler
from crawl4ai.async_configs import LLMConfig
import os

async def run_adaptive_crawl():
    start_url = "https://danica.dk/en/personal"
    query = "who is on the board of directors"

    adaptive_config = AdaptiveConfig(
        strategy="embedding",
        confidence_threshold=0.8,
        embedding_model="sentence-transformers/all-MiniLM-L6-v2",

    )

    async with AsyncWebCrawler() as crawler:
        adaptive_crawler = AdaptiveCrawler(
            crawler=crawler,
            config=adaptive_config
        )

        print(f"Starting adaptive crawl for URL: {start_url} with query: '{query}'")
        print(f"Using strategy: {adaptive_config.strategy}")

        crawl_state = await adaptive_crawler.digest(
            start_url=start_url,
            query=query
        )

        print("\nAdaptive Crawl Completed.")
        adaptive_crawler.print_stats(detailed=True)

if __name__ == "__main__":
    asyncio.run(run_adaptive_crawl())

OS

macOS

Python version

3.12

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Nov 17 '25 15:11 ntohidi

crawl4ai crawl4ai copied to clipboard

[Bug]: AdaptiveCrawler - Embedding strategy broken: query expansion uses hardcoded mock data instead of LLM

crawl4ai version

Expected Behavior

Current Behavior

Is this reproducible?

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

Python version

Browser

Browser version

Error logs & Screenshots (if applicable)

crawl4ai
crawl4ai copied to clipboard