crewAI icon indicating copy to clipboard operation
crewAI copied to clipboard

[BUG] Knowledge Source metadata generation doesn’t work (and possibly the knowledge store at all)

Open opahopa opened this issue 1 year ago • 5 comments

Description

So, i've been fighting to use the crewAI docs as the knowledge source for the crew.

Meanwhile, found out following errors:

  1. StringKnowledgeSource accepts a single string a knowledge source, breaks it in chunks, but does not generates the metadata for the chunks, however requires it. Thus, producing errors like:
[WARNING]: Failed to init knowledge: Unequal lengths for fields: ids: 487, metadatas: 1, documents: 487 in upsert.

when used as:

            StringKnowledgeSource(
                content=load_crewai_docs(),   #load_crewai_docs returns single-line string
            )
  1. There is a typo in docs (knowledge section, sample custom knowledge source):
    def add(self) -> None:
        """Process and store the articles."""
        content = self.load_content()
        for _, text in content.items():
            chunks = self._chunk_text(text)
            self.chunks.extend(chunks)

        self._save_documents()

-> parent function have save_documents() function without prefix _

  1. i've made the custom LocalTxTFileKnowledgeSource where i've included the dummy metadata generation:
class LocalTxTFileKnowledgeSource(BaseKnowledgeSource):
    file_path: str = Field(description="Path to the local .txt file")
    def load_content(self) -> Dict[str, str]:
        try:
            with open(self.file_path, "r", encoding="utf-8") as file:
                content = file.read()
            return {self.file_path: content}
        except Exception as e:
            raise ValueError(f"Failed to read the file {self.file_path}: {str(e)}")

    def add(self) -> None:
        """Process and store the file content."""
        content = self.load_content()

        for _, text in content.items():
            chunks = self._chunk_text(text)
            self.chunks.extend(chunks)

            chunks_metadata = [
                {
                    "chunk_id": str(uuid.uuid4()),
                    "source": self.file_path,
                    "description": f"Chunk {i + 1} from file {self.file_path}"
                }
                for i in range(len(chunks))
            ]

        self.save_documents(metadata=chunks_metadata)

(my guess is that metadata entries must contain some summary headers instead?)

when used in crew settings:

            knowledge_sources=[LocalTxTFileKnowledgeSource(
                file_path="docs_crewai/singlefile.txt",
            )

I didn't get any errors and seems like the documents were saved. However the agents within the crew behaved as if the data was not accessed, so maybe then it wasn't loaded somehow? or loaded in some incorrect format?

Steps to Reproduce

Make custom crew:

@CrewBase
class CustomAgents:
    agents_config = 'config/agents.yaml'
    tasks_config = 'config/tasks.yaml'

    gpt4o_mini = LLM(
        model="gpt-4o-mini",
    )
    gpt4o = LLM(
        model="gpt-4o",
    )
    @agent
    def agent_qa(self):
        return Agent(
            config=self.agents_config['agent_qa'],
            verbose=True,
            llm=self.gpt4o_mini,
            max_iter=3
        )

    @task
    def test_crew(self) -> Task:
        return Task(
            config=self.tasks_config['test_crew'],
        )


    @crew
    def crew(self) -> Crew:
        return Crew(
            agents=self.agents,
            tasks=[self.test_crew() ],
            process=Process.sequential,
            memory=True,
            verbose=True,
            StringKnowledgeSource(
                content="long string",
            ),]
        )

and custom knowledge source:

class LocalTxTFileKnowledgeSource(BaseKnowledgeSource):
 file_path: str = Field(description="Path to the local .txt file")
 def load_content(self) -> Dict[str, str]:
     try:
         with open(self.file_path, "r", encoding="utf-8") as file:
             content = file.read()
         return {self.file_path: content}
     except Exception as e:
         raise ValueError(f"Failed to read the file {self.file_path}: {str(e)}")

 def add(self) -> None:
     """Process and store the file content."""
     content = self.load_content()

     for _, text in content.items():
         chunks = self._chunk_text(text)
         self.chunks.extend(chunks)

         chunks_metadata = [
             {
                 "chunk_id": str(uuid.uuid4()),
                 "source": self.file_path,
                 "description": f"Chunk {i + 1} from file {self.file_path}"
             }
             for i in range(len(chunks))
         ]

     self.save_documents(metadata=chunks_metadata)
     ```
     
     
     Then try running first with the String source, then with the `LocalTxTFileKnowledgeSource` (remove metadata generation).
     
     Then run the crew with the task which is expects an uploaded source knowledge from the agent.

### Expected behavior

1. in case of string knowledge source -  it is saved without metadata error
2. in case of custom source - it is saved without metadata error (suppose i don't provide the dummy one)

Afterwards, on `crew.kickoff` the agent expected to be aware of the uploaded data.

### Screenshots/Code snippets

provided above

### Operating System

macOS Sonoma

### Python Version

3.12

### crewAI Version

crewai==0.86.0

### crewAI Tools Version

crewai-tools==0.17.0

### Virtual Environment

Venv

### Evidence

(.venv) *@*-MacBook-Pro SomeCrew % 
python main.py
## Welcome to CrewAI!
-------------------------------

[2024-12-13 12:36:44][ERROR]: Failed to upsert documents: Unequal lengths for fields: ids: 487, metadatas: 1, documents: 487 in upsert.

[2024-12-13 12:36:44][WARNING]: Failed to init knowledge: Unequal lengths for fields: ids: 487, metadatas: 1, documents: 487 in upsert.


### Possible Solution

1. Add metadata generation on chunking the text. or remove metadata requirement
2. Check if agents are accessing the stored knowledge sources

### Additional context

single string data used: https://docs.crewai.com/llms-full.txt

opahopa avatar Dec 13 '24 10:12 opahopa

Thanks for reporting this @opahopa. We are waiting on the new version cut that will remove the metadata requirement. For now, you can have a simple matedata param but it won't do anything in the grand scheme of the extraction.

tonykipkemboi avatar Dec 13 '24 15:12 tonykipkemboi

Sure, thanks for the answer! can you suggest how can i debug whether the data from the knowledge source was passed to the crew? cause atm within my setup it seems like it isn't accessible by the agents.

upd: found crew.query_knowledge

opahopa avatar Dec 13 '24 16:12 opahopa

Hi, just want to recognise @opahopa for his temporary solution, and let them know that it helped me get my crew working.

Sharing my working implementation, for Local Text File Knowledge

from crewai import Agent, Crew, Process, Task, LLM
from crewai.project import CrewBase, agent, crew, task, before_kickoff, after_kickoff
# from crewai.knowledge.source.pdf_knowledge_source import PDFKnowledgeSource
# from crewai.knowledge.source.text_file_knowledge_source import TextFileKnowledgeSource

# Knowledge temporary fix
from crewai.knowledge.source.base_knowledge_source import BaseKnowledgeSource
from pydantic import Field
from typing import Dict
import uuid

# AgentOps Agent Observability
import agentops

# Start AgentOps Monitoring 
agentops.init()

# Create an LLM with a temperature of 0 to ensure deterministic outputs
# llm_reviewer = LLM(model="gpt-4o-mini", temperature=0)
# Create an LLM with a temperature of 0.1 to allow some creativity.
# llm = LLM(model="gpt-4o-mini", temperature=.1)

class LocalTxTFileKnowledgeSource(BaseKnowledgeSource):
    file_path: str = Field(description="Path to the local .txt file")
    def load_content(self) -> Dict[str, str]:
        try:
            with open(self.file_path, "r", encoding="utf-8") as file:
                content = file.read()
            return {self.file_path: content}
        except Exception as e:
            raise ValueError(f"Failed to read the file {self.file_path}: {str(e)}")

    def add(self) -> None:
        """Process and store the file content."""
        content = self.load_content()

        for _, text in content.items():
            chunks = self._chunk_text(text)
            self.chunks.extend(chunks)

            chunks_metadata = [
                {
                    "chunk_id": str(uuid.uuid4()),
                    "source": self.file_path,
                    "description": f"Chunk {i + 1} from file {self.file_path}"
                }
                for i in range(len(chunks))
            ]

        self.save_documents(metadata=chunks_metadata)

@CrewBase
class Demo2():
	"""Demo2 crew"""

	agents_config = 'config/agents.yaml'
	tasks_config = 'config/tasks.yaml'

	@before_kickoff # Optional hook to be executed before the crew starts
	def pull_data_example(self, inputs):
		# Example of pulling data from an external API, dynamically changing the inputs
		# inputs['extra_data'] = "This is extra data"
		# return inputs
		return

	@after_kickoff # Optional hook to be executed after the crew has finished
	def log_results(self, output):
		# Example of logging results, dynamically changing the output
		print(f"Results: {output}")
		return output

	@agent
	def reviewer(self) -> Agent:
		return Agent(
			config=self.agents_config['reviewer'],
			memory=True,
			verbose=True,
			max_rpm=10# ,  # Limit API calls
			# llm=llm_reviewer
		)

	@agent
	def lesson_planner(self) -> Agent:
		return Agent(
			config=self.agents_config['lesson_planner'],
			memory=True,
			verbose=True,
			max_rpm=10  # Limit API calls
		)

	@agent
	def content_writer(self) -> Agent:
		return Agent(
			config=self.agents_config['content_writer'],
			memory=True,
			verbose=True,
			max_rpm=10  # Limit API calls
		)

	@agent
	def communicator(self) -> Agent:
		return Agent(
			config=self.agents_config['communicator'],
			memory=True,
			verbose=True,
			max_rpm=10  # Limit API calls
		)

	@task
	def documentation_review_task(self) -> Task:
		return Task(
			config=self.tasks_config['documentation_review_task'],
			output_file='outputs/1_documentation_review_task.md'
		)

	@task
	def lesson_plan_task(self) -> Task:
		return Task(
			config=self.tasks_config['lesson_plan_task'],
			output_file='outputs/2_lesson_plan_task.md'
		)

	@task
	def lesson_content_creation_task(self) -> Task:
		return Task(
			config=self.tasks_config['lesson_content_creation_task'],
			output_file='outputs/3_lesson_content_creation_task.md'
		)
	
	@task
	def learning_path_description_task(self) -> Task:
		return Task(
			config=self.tasks_config['learning_path_description_task'],
			output_file='outputs/4_learning_path_description_task.md'
		)

	@task
	def announcement_post_task(self) -> Task:
		return Task(
			config=self.tasks_config['announcement_post_task'],
			output_file='outputs/5_announcement_post_task.md'
		)

	@crew
	def crew(self) -> Crew:
		"""Creates the Demo2 crew"""
		# pdf_source = PDFKnowledgeSource(file_path="filename.pdf")
		# pdf_source = PDFKnowledgeSource(file_path="filename.pdf")
		# txt_source = TextFileKnowledgeSource(file_path="filename.txt")
		local_txt_source = LocalTxTFileKnowledgeSource(file_path="knowledge/filename.txt")
		return Crew(
			agents=self.agents, # Automatically created by the @agent decorator
			tasks=self.tasks, # Automatically created by the @task decorator
			process=Process.sequential,
			verbose=True,
			knowledge_sources=[local_txt_source],
			output_file='outputs/0_crew_output.md'
		)

# End AgentOps Monitoring (gracefully)
agentops.end_session('Success')

ATAD4NRY4N avatar Dec 16 '24 12:12 ATAD4NRY4N

Hello @opahopa! We're tracking this bug and we've introduced a new fix that should resolve this issue.

When our new version get's released 0.86.1, this should go away.

Super sorry about any inconveniences this caused! Joao will announce on his X account right when the new version goes live.

bhancockio avatar Dec 16 '24 21:12 bhancockio

Hey @tonykipkemboi , what can I provide in save_documents when making an API call?

from crewai import Agent, Task, Crew, Process, LLM from crewai.knowledge.source.base_knowledge_source import BaseKnowledgeSource import requests from datetime import datetime from typing import Dict, Any from pydantic import BaseModel, Field import uuid

class SpaceNewsKnowledgeSource(BaseKnowledgeSource): """Knowledge source that fetches data from Space News API."""

api_endpoint: str = Field(description="API endpoint URL")
limit: int = Field(default=10, description="Number of articles to fetch")

def load_content(self) -> Dict[Any, str]:
    """Fetch and format space news articles."""
    try:
        response = requests.get(
            f"{self.api_endpoint}?limit={self.limit}"
        )
        response.raise_for_status()

        data = response.json()
        articles = data.get('results', [])

        formatted_data = self._format_articles(articles)
        return {self.api_endpoint: formatted_data}
    except Exception as e:
        raise ValueError(f"Failed to fetch space news: {str(e)}")

def _format_articles(self, articles: list) -> str:
    """Format articles into readable text."""
    formatted = "Space News Articles:\n\n"
    for article in articles:
        formatted += f"""
            Title: {article['title']}
            Published: {article['published_at']}
            Summary: {article['summary']}
            News Site: {article['news_site']}
            URL: {article['url']}
            -------------------"""
    return formatted

def add(self) -> None:
    """Process and store the articles."""
    content = self.load_content()
    for _, text in content.items():
        chunks = self._chunk_text(text)
        self.chunks.extend(chunks)
    
    # Save documents with metadata
    self.save_documents(metadata={"source": "Space News API"})

Create knowledge source

recent_news = SpaceNewsKnowledgeSource( api_endpoint="https://api.spaceflightnewsapi.net/v4/articles", limit=10, )

Create specialized agent

space_analyst = Agent( role="Space News Analyst", goal="Answer questions about space news accurately and comprehensively", backstory="""You are a space industry analyst with expertise in space exploration, satellite technology, and space industry trends. You excel at answering questions about space news and providing detailed, accurate information.""", knowledge_sources=[recent_news], llm=LLM(model="gpt-4", temperature=0.0) )

Create task that handles user questions

analysis_task = Task( description="Answer this question about space news: {user_question}", expected_output="A detailed answer based on the recent space news articles", agent=space_analyst )

Create and run the crew

crew = Crew( agents=[space_analyst], tasks=[analysis_task], verbose=True, process=Process.sequential )

Example usage

result = crew.kickoff( inputs={"user_question": "What are the latest developments in space exploration?"} )

print(result)

Highonswift avatar Dec 23 '24 09:12 Highonswift

I just wanted to add, that this error is still present and also with the CrewDoclingSource. Error: NameError: name 'DoclingDocument' is not defined!!

Any suggestions how to fix it?

damirkusar avatar Jan 13 '25 15:01 damirkusar

I just wanted to add, that this error is still present and also with the CrewDoclingSource. Error: NameError: name 'DoclingDocument' is not defined!!

Any suggestions how to fix it?

To solve this you can install docling package. Either do: pip install docling or uv add docling

But even after doing this I am facing the following error: raise ConversionError(error_message) docling.exceptions.ConversionError: File format not allowed: knowledge\abcd.txt

Sanjaya-005 avatar Jan 15 '25 05:01 Sanjaya-005

yeah after the last update metadata issues were fixed, and the new knowledge source introduced. also got those import errors.

So, there are few issues:

you need to install the docling package. this exception is not rised for some reason:

        if not DOCLING_AVAILABLE:
            raise ImportError(
                "The docling package is required to use CrewDoclingSource. "
                "Please install it using: uv add docling"
            )

Note that the supported formats are:

        default_factory=lambda: DocumentConverter(
            allowed_formats=[
                InputFormat.MD,
                InputFormat.ASCIIDOC,
                InputFormat.PDF,
                InputFormat.DOCX,
                InputFormat.HTML,
                InputFormat.IMAGE,
                InputFormat.XLSX,
                InputFormat.PPTX,
            ]
        )

so the docs are incorrect when mentioning the .txt source

# Create a text file knowledge source

text_source = CrewDoclingSource(
    file_paths=["document.txt", "another.txt"]
)

opahopa avatar Jan 15 '25 07:01 opahopa

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Feb 14 '25 12:02 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Feb 20 '25 12:02 github-actions[bot]