LLPhant Enhanced Metadata Handling for Documents and Data Readers

This Pull Request introduces extensible metadata handling for Document objects and updates the DataReader interface and its implementation (FileDataReader) to support metadata extraction and management. The changes improve interoperability with Retrieval-Augmented Generation (RAG) workflows and provide a modular approach to embedding metadata into documents.

Key Changes

Document Class Enhancements:
- Added a metadata property to store key-value pairs of extensible metadata.
- Introduced addMetadata and toArray methods to manage and serialize metadata.
DataReader Interface:
- Added an extractMetadata method to standardize metadata extraction from document content.
FileDataReader:
- Implemented the extractMetadata method to parse and populate metadata fields from the content.
- Automatically populates metadata during the creation of Document objects.
DocumentUtils Enhancements:
- Updated utility functions to support metadata when creating documents from arrays.
- Ensured compatibility with new and existing functionality.
Tests:
- Added tests to validate metadata extraction, assignment, and serialization.
- Ensured all existing tests remain functional to maintain backward compatibility.

Benefits of This Contribution

Enhanced Metadata Support:
- Metadata (e.g., titles, categories, tags) can now be embedded into Document objects, providing rich context for document retrieval and organization.
Improved RAG Workflows:
- Metadata is critical for Retrieval-Augmented Generation (RAG) workflows, enabling:
  - Better search and filtering: Use metadata to refine searches in vector stores.
  - Increased accuracy: Improve relevance by aligning metadata with document embeddings.
  - Custom queries: Leverage metadata fields for fine-grained information retrieval.
Extensibility:
- Developers can easily add new metadata fields without modifying the core library.
- The extractMetadata method allows custom data readers to parse metadata from diverse file formats.
Backward Compatibility:
- All existing functionality remains unchanged, ensuring no breaking changes for current users.
Ease of Integration:
- Standardized metadata handling ensures smooth integration with vector stores, such as Qdrant or Pinecone.

How to Use the Enhanced Features

1. Add Metadata in a Custom Data Reader

Implement the extractMetadata method in a custom data reader to define how metadata is parsed:

use LLPhant\Embeddings\DataReader\DataReader;
use LLPhant\Embeddings\Document;

class MyCustomDataReader implements DataReader
{
    public function getDocuments(): array
    {
        $content = "Sample document content";
        $document = new Document();
        $document->content = $content;

        // Extract and add metadata
        $metadata = $this->extractMetadata($content);
        foreach ($metadata as $key => $value) {
            $document->addMetadata($key, $value);
        }

        return [$document];
    }

    public function extractMetadata(string $content): array
    {
        // Custom metadata extraction logic
        return [
            'title' => 'Extracted Title',
            'category' => 'Extracted Category',
            // Add more metadata fields as needed
        ];
    }
}

2. Retrieve Metadata in a RAG Workflow

Use the toArray method to serialize documents with metadata:

$documents = $dataReader->getDocuments();
foreach ($documents as $document) {
    $metadata = $document->metadata;
    print_r($metadata); // Outputs: ['title' => 'Extracted Title', 'category' => 'Extracted Category']
}

3. Store Metadata in Vector Stores

Combine content and metadata for embedding in vector stores:

foreach ($documents as $document) {
    $vectorStore->upsert([
        'id' => DocumentUtils::getUniqueId($document),
        'embedding' => $document->embedding,
        'metadata' => $document->metadata,
    ]);
}

Use Cases with RAG Workflows

1. Metadata-Driven Retrieval

Scenario: Search documents by category or tags before applying semantic search on embeddings.

Query Example:

{
  "filter": {
    "category": "User Guide"
  },
  "vector": [0.12, 0.34, 0.56],
  "top_k": 5
}

2. Context-Aware Augmented Responses

Scenario: Include metadata (e.g., sourceType, title) in AI responses to provide additional context.
Example:

"The information comes from the document titled 'User Guide for Product X'."

3. Chunk-Based Metadata

Scenario: Manage individual chunks of large documents using metadata.
Implementation:
- Add a chunkNumber to metadata for better traceability and reconstruction.

Potential Enhancements

User-Defined Metadata Parsers:
- Allow users to plug in custom metadata parsing logic for different file types.
Utility Methods for Metadata Queries:
- Provide methods like findDocumentsByMetadata to simplify metadata-based retrieval.
Examples for Specific Vector Stores:
- Add documentation or examples showing integration with vector stores like Qdrant and Pinecone.

Checklist

[x] Added metadata support to Document.
[x] Updated DataReader and FileDataReader for metadata handling.
[x] Enhanced DocumentUtils for metadata compatibility.
[x] Included comprehensive tests.
[x] Validated backward compatibility.

Let me know if you need any further adjustments or additional information! 🚀

Nov 25 '24 11:11 raihan-js

Hey @raihan-js , Thanks a lot for that and sorry for the late reply. This is really good. How can I help you finish it?

Dec 31 '24 13:12 MaximeThoonsen

Hey @MaximeThoonsen, Thanks for getting back to me. This is completed, but failing the tests here, if you look into it, maybe help me out passing the tests here? Or any suggestions?

Dec 31 '24 13:12 raihan-js

@raihan-js You can use composer lintto fix to linting problem. And the for unit test I suggested you a change in a comment

Jan 02 '25 15:01 MaximeThoonsen

please merge and release

Feb 21 '25 14:02 pslxx

please merge and release

Sorry, but we cannot merge this PR as it is, since it would break many vector stores implementations, as you can see running

composer test:types

Mar 25 '25 21:03 f-lombardo