[BUG] Knowledge Source metadata generation doesn’t work (and possibly the knowledge store at all)
Description
So, i've been fighting to use the crewAI docs as the knowledge source for the crew.
Meanwhile, found out following errors:
StringKnowledgeSourceaccepts a single string a knowledge source, breaks it in chunks, but does not generates the metadata for the chunks, however requires it. Thus, producing errors like:
[WARNING]: Failed to init knowledge: Unequal lengths for fields: ids: 487, metadatas: 1, documents: 487 in upsert.
when used as:
StringKnowledgeSource(
content=load_crewai_docs(), #load_crewai_docs returns single-line string
)
- There is a typo in docs (
knowledgesection, sample custom knowledge source):
def add(self) -> None:
"""Process and store the articles."""
content = self.load_content()
for _, text in content.items():
chunks = self._chunk_text(text)
self.chunks.extend(chunks)
self._save_documents()
-> parent function have save_documents() function without prefix _
- i've made the custom
LocalTxTFileKnowledgeSourcewhere i've included the dummy metadata generation:
class LocalTxTFileKnowledgeSource(BaseKnowledgeSource):
file_path: str = Field(description="Path to the local .txt file")
def load_content(self) -> Dict[str, str]:
try:
with open(self.file_path, "r", encoding="utf-8") as file:
content = file.read()
return {self.file_path: content}
except Exception as e:
raise ValueError(f"Failed to read the file {self.file_path}: {str(e)}")
def add(self) -> None:
"""Process and store the file content."""
content = self.load_content()
for _, text in content.items():
chunks = self._chunk_text(text)
self.chunks.extend(chunks)
chunks_metadata = [
{
"chunk_id": str(uuid.uuid4()),
"source": self.file_path,
"description": f"Chunk {i + 1} from file {self.file_path}"
}
for i in range(len(chunks))
]
self.save_documents(metadata=chunks_metadata)
(my guess is that metadata entries must contain some summary headers instead?)
when used in crew settings:
knowledge_sources=[LocalTxTFileKnowledgeSource(
file_path="docs_crewai/singlefile.txt",
)
I didn't get any errors and seems like the documents were saved. However the agents within the crew behaved as if the data was not accessed, so maybe then it wasn't loaded somehow? or loaded in some incorrect format?
Steps to Reproduce
Make custom crew:
@CrewBase
class CustomAgents:
agents_config = 'config/agents.yaml'
tasks_config = 'config/tasks.yaml'
gpt4o_mini = LLM(
model="gpt-4o-mini",
)
gpt4o = LLM(
model="gpt-4o",
)
@agent
def agent_qa(self):
return Agent(
config=self.agents_config['agent_qa'],
verbose=True,
llm=self.gpt4o_mini,
max_iter=3
)
@task
def test_crew(self) -> Task:
return Task(
config=self.tasks_config['test_crew'],
)
@crew
def crew(self) -> Crew:
return Crew(
agents=self.agents,
tasks=[self.test_crew() ],
process=Process.sequential,
memory=True,
verbose=True,
StringKnowledgeSource(
content="long string",
),]
)
and custom knowledge source:
class LocalTxTFileKnowledgeSource(BaseKnowledgeSource):
file_path: str = Field(description="Path to the local .txt file")
def load_content(self) -> Dict[str, str]:
try:
with open(self.file_path, "r", encoding="utf-8") as file:
content = file.read()
return {self.file_path: content}
except Exception as e:
raise ValueError(f"Failed to read the file {self.file_path}: {str(e)}")
def add(self) -> None:
"""Process and store the file content."""
content = self.load_content()
for _, text in content.items():
chunks = self._chunk_text(text)
self.chunks.extend(chunks)
chunks_metadata = [
{
"chunk_id": str(uuid.uuid4()),
"source": self.file_path,
"description": f"Chunk {i + 1} from file {self.file_path}"
}
for i in range(len(chunks))
]
self.save_documents(metadata=chunks_metadata)
```
Then try running first with the String source, then with the `LocalTxTFileKnowledgeSource` (remove metadata generation).
Then run the crew with the task which is expects an uploaded source knowledge from the agent.
### Expected behavior
1. in case of string knowledge source - it is saved without metadata error
2. in case of custom source - it is saved without metadata error (suppose i don't provide the dummy one)
Afterwards, on `crew.kickoff` the agent expected to be aware of the uploaded data.
### Screenshots/Code snippets
provided above
### Operating System
macOS Sonoma
### Python Version
3.12
### crewAI Version
crewai==0.86.0
### crewAI Tools Version
crewai-tools==0.17.0
### Virtual Environment
Venv
### Evidence
(.venv) *@*-MacBook-Pro SomeCrew %
python main.py
## Welcome to CrewAI!
-------------------------------
[2024-12-13 12:36:44][ERROR]: Failed to upsert documents: Unequal lengths for fields: ids: 487, metadatas: 1, documents: 487 in upsert.
[2024-12-13 12:36:44][WARNING]: Failed to init knowledge: Unequal lengths for fields: ids: 487, metadatas: 1, documents: 487 in upsert.
### Possible Solution
1. Add metadata generation on chunking the text. or remove metadata requirement
2. Check if agents are accessing the stored knowledge sources
### Additional context
single string data used: https://docs.crewai.com/llms-full.txt
Thanks for reporting this @opahopa. We are waiting on the new version cut that will remove the metadata requirement. For now, you can have a simple matedata param but it won't do anything in the grand scheme of the extraction.
Sure, thanks for the answer! can you suggest how can i debug whether the data from the knowledge source was passed to the crew? cause atm within my setup it seems like it isn't accessible by the agents.
upd: found crew.query_knowledge
Hi, just want to recognise @opahopa for his temporary solution, and let them know that it helped me get my crew working.
Sharing my working implementation, for Local Text File Knowledge
from crewai import Agent, Crew, Process, Task, LLM
from crewai.project import CrewBase, agent, crew, task, before_kickoff, after_kickoff
# from crewai.knowledge.source.pdf_knowledge_source import PDFKnowledgeSource
# from crewai.knowledge.source.text_file_knowledge_source import TextFileKnowledgeSource
# Knowledge temporary fix
from crewai.knowledge.source.base_knowledge_source import BaseKnowledgeSource
from pydantic import Field
from typing import Dict
import uuid
# AgentOps Agent Observability
import agentops
# Start AgentOps Monitoring
agentops.init()
# Create an LLM with a temperature of 0 to ensure deterministic outputs
# llm_reviewer = LLM(model="gpt-4o-mini", temperature=0)
# Create an LLM with a temperature of 0.1 to allow some creativity.
# llm = LLM(model="gpt-4o-mini", temperature=.1)
class LocalTxTFileKnowledgeSource(BaseKnowledgeSource):
file_path: str = Field(description="Path to the local .txt file")
def load_content(self) -> Dict[str, str]:
try:
with open(self.file_path, "r", encoding="utf-8") as file:
content = file.read()
return {self.file_path: content}
except Exception as e:
raise ValueError(f"Failed to read the file {self.file_path}: {str(e)}")
def add(self) -> None:
"""Process and store the file content."""
content = self.load_content()
for _, text in content.items():
chunks = self._chunk_text(text)
self.chunks.extend(chunks)
chunks_metadata = [
{
"chunk_id": str(uuid.uuid4()),
"source": self.file_path,
"description": f"Chunk {i + 1} from file {self.file_path}"
}
for i in range(len(chunks))
]
self.save_documents(metadata=chunks_metadata)
@CrewBase
class Demo2():
"""Demo2 crew"""
agents_config = 'config/agents.yaml'
tasks_config = 'config/tasks.yaml'
@before_kickoff # Optional hook to be executed before the crew starts
def pull_data_example(self, inputs):
# Example of pulling data from an external API, dynamically changing the inputs
# inputs['extra_data'] = "This is extra data"
# return inputs
return
@after_kickoff # Optional hook to be executed after the crew has finished
def log_results(self, output):
# Example of logging results, dynamically changing the output
print(f"Results: {output}")
return output
@agent
def reviewer(self) -> Agent:
return Agent(
config=self.agents_config['reviewer'],
memory=True,
verbose=True,
max_rpm=10# , # Limit API calls
# llm=llm_reviewer
)
@agent
def lesson_planner(self) -> Agent:
return Agent(
config=self.agents_config['lesson_planner'],
memory=True,
verbose=True,
max_rpm=10 # Limit API calls
)
@agent
def content_writer(self) -> Agent:
return Agent(
config=self.agents_config['content_writer'],
memory=True,
verbose=True,
max_rpm=10 # Limit API calls
)
@agent
def communicator(self) -> Agent:
return Agent(
config=self.agents_config['communicator'],
memory=True,
verbose=True,
max_rpm=10 # Limit API calls
)
@task
def documentation_review_task(self) -> Task:
return Task(
config=self.tasks_config['documentation_review_task'],
output_file='outputs/1_documentation_review_task.md'
)
@task
def lesson_plan_task(self) -> Task:
return Task(
config=self.tasks_config['lesson_plan_task'],
output_file='outputs/2_lesson_plan_task.md'
)
@task
def lesson_content_creation_task(self) -> Task:
return Task(
config=self.tasks_config['lesson_content_creation_task'],
output_file='outputs/3_lesson_content_creation_task.md'
)
@task
def learning_path_description_task(self) -> Task:
return Task(
config=self.tasks_config['learning_path_description_task'],
output_file='outputs/4_learning_path_description_task.md'
)
@task
def announcement_post_task(self) -> Task:
return Task(
config=self.tasks_config['announcement_post_task'],
output_file='outputs/5_announcement_post_task.md'
)
@crew
def crew(self) -> Crew:
"""Creates the Demo2 crew"""
# pdf_source = PDFKnowledgeSource(file_path="filename.pdf")
# pdf_source = PDFKnowledgeSource(file_path="filename.pdf")
# txt_source = TextFileKnowledgeSource(file_path="filename.txt")
local_txt_source = LocalTxTFileKnowledgeSource(file_path="knowledge/filename.txt")
return Crew(
agents=self.agents, # Automatically created by the @agent decorator
tasks=self.tasks, # Automatically created by the @task decorator
process=Process.sequential,
verbose=True,
knowledge_sources=[local_txt_source],
output_file='outputs/0_crew_output.md'
)
# End AgentOps Monitoring (gracefully)
agentops.end_session('Success')
Hello @opahopa! We're tracking this bug and we've introduced a new fix that should resolve this issue.
When our new version get's released 0.86.1, this should go away.
Super sorry about any inconveniences this caused! Joao will announce on his X account right when the new version goes live.
Hey @tonykipkemboi , what can I provide in save_documents when making an API call?
from crewai import Agent, Task, Crew, Process, LLM from crewai.knowledge.source.base_knowledge_source import BaseKnowledgeSource import requests from datetime import datetime from typing import Dict, Any from pydantic import BaseModel, Field import uuid
class SpaceNewsKnowledgeSource(BaseKnowledgeSource): """Knowledge source that fetches data from Space News API."""
api_endpoint: str = Field(description="API endpoint URL")
limit: int = Field(default=10, description="Number of articles to fetch")
def load_content(self) -> Dict[Any, str]:
"""Fetch and format space news articles."""
try:
response = requests.get(
f"{self.api_endpoint}?limit={self.limit}"
)
response.raise_for_status()
data = response.json()
articles = data.get('results', [])
formatted_data = self._format_articles(articles)
return {self.api_endpoint: formatted_data}
except Exception as e:
raise ValueError(f"Failed to fetch space news: {str(e)}")
def _format_articles(self, articles: list) -> str:
"""Format articles into readable text."""
formatted = "Space News Articles:\n\n"
for article in articles:
formatted += f"""
Title: {article['title']}
Published: {article['published_at']}
Summary: {article['summary']}
News Site: {article['news_site']}
URL: {article['url']}
-------------------"""
return formatted
def add(self) -> None:
"""Process and store the articles."""
content = self.load_content()
for _, text in content.items():
chunks = self._chunk_text(text)
self.chunks.extend(chunks)
# Save documents with metadata
self.save_documents(metadata={"source": "Space News API"})
Create knowledge source
recent_news = SpaceNewsKnowledgeSource( api_endpoint="https://api.spaceflightnewsapi.net/v4/articles", limit=10, )
Create specialized agent
space_analyst = Agent( role="Space News Analyst", goal="Answer questions about space news accurately and comprehensively", backstory="""You are a space industry analyst with expertise in space exploration, satellite technology, and space industry trends. You excel at answering questions about space news and providing detailed, accurate information.""", knowledge_sources=[recent_news], llm=LLM(model="gpt-4", temperature=0.0) )
Create task that handles user questions
analysis_task = Task( description="Answer this question about space news: {user_question}", expected_output="A detailed answer based on the recent space news articles", agent=space_analyst )
Create and run the crew
crew = Crew( agents=[space_analyst], tasks=[analysis_task], verbose=True, process=Process.sequential )
Example usage
result = crew.kickoff( inputs={"user_question": "What are the latest developments in space exploration?"} )
print(result)
I just wanted to add, that this error is still present and also with the CrewDoclingSource. Error: NameError: name 'DoclingDocument' is not defined!!
Any suggestions how to fix it?
I just wanted to add, that this error is still present and also with the CrewDoclingSource. Error: NameError: name 'DoclingDocument' is not defined!!
Any suggestions how to fix it?
To solve this you can install docling package. Either do:
pip install docling
or
uv add docling
But even after doing this I am facing the following error:
raise ConversionError(error_message) docling.exceptions.ConversionError: File format not allowed: knowledge\abcd.txt
yeah after the last update metadata issues were fixed, and the new knowledge source introduced. also got those import errors.
So, there are few issues:
you need to install the docling package. this exception is not rised for some reason:
if not DOCLING_AVAILABLE:
raise ImportError(
"The docling package is required to use CrewDoclingSource. "
"Please install it using: uv add docling"
)
Note that the supported formats are:
default_factory=lambda: DocumentConverter(
allowed_formats=[
InputFormat.MD,
InputFormat.ASCIIDOC,
InputFormat.PDF,
InputFormat.DOCX,
InputFormat.HTML,
InputFormat.IMAGE,
InputFormat.XLSX,
InputFormat.PPTX,
]
)
so the docs are incorrect when mentioning the .txt source
# Create a text file knowledge source
text_source = CrewDoclingSource(
file_paths=["document.txt", "another.txt"]
)
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.