crawl4ai icon indicating copy to clipboard operation
crawl4ai copied to clipboard

[Bug]: EmbeddingStrategy mixes generator & embedding configs + leftover mock causes “fried rice” variations and 403s

Open NickKotte opened this issue 1 month ago • 1 comments

crawl4ai version

0.7.6

Expected Behavior

Adaptive crawler. strategy="embedding", embedding_model="openai/text-embedding-3-small", embedding_llm_config=LLMConfig( provider="openai/gpt-4o-mini", // openai/text-embedding-3-small fails here too api_token=OPENAI_API_KEY, ), Query:
'capabilities, services,
certifications, description of work, products'

Query Expansion: Original query expanded to 12 variations

  1. Where can I compare prices for various products?
  2. What new products have been launched this year?
  3. What products are recommended for pet owners?
  4. What are the must-have products for outdoor activities?

Current Behavior

•	Query variations are replaced by a hard-coded “fried rice” list.
•	embedding_llm_config is reused for both generation and embeddings, so the wrong provider/model can hit the wrong API:
•	Chat model sent to embeddings endpoint → 403.
•	Embedding model used as a “provider” for text generation → failures or zero variations.
•	Embedding dimension sometimes mismatches the configured embedding_model.

Is this reproducible?

Yes

Inputs Causing the Bug

line 700
map_query_semantic_space function uses left over mock data
doesn't use the correct model for expansion or embedding

Steps to Reproduce

A) hard coded query variations:
1.	Use strategy="embedding" and call AdaptiveCrawler.digest(...).
2.	Observe variations list: always food-related (“fried rice…”) regardless of query.
B) 403 when embeddings are requested
`AdaptiveConfig(
  strategy="embedding",
  embedding_model="openai/text-embedding-3-small",
  embedding_llm_config=LLMConfig(
    provider="openai/gpt-4o-mini",
    api_token=OPENAI_API_KEY,
  ),
  n_query_variations=12,
)`
2.	Run digest(...).
3.	Intermittently see: `403 - You are not allowed to generate embeddings from this model` or end up with variations: 0, and embedding dims/behavior inconsistent with the configured model.

Code snippets

adaptive = AdaptiveCrawler(crawler, adaptive_cfg)
 result = await adaptive.digest(start_url=start_url, query=query)

Or literally just run the adaptive crawler example available in the Craw4ai repository.

OS

macOS

Python version

3.13.5

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

litellm.exceptions.BadRequestError: litellm.BadRequestError: OpenAIException - Error code: 403 - {'error': {'message': 'You are not allowed to generate embeddings from this model', 'type': 'invalid_request_error', 'param': None, 'code': None}} and Adaptive Crawl Stats - Query:
'capabilities, services,
certifications, description of work, products'
Query Expansion: Original query expanded to 4 variations

  1. how to add flavor to vegetable fried rice?
  2. what are the best vegetables to use in fried rice?
  3. are there any tips for making healthy fried rice with vegetables?
  4. how do I make vegetable fried rice from scratch? ... `

NickKotte avatar Oct 29 '25 15:10 NickKotte