Enhanced Metadata Handling for Documents and Data Readers
This Pull Request introduces extensible metadata handling for Document objects and updates the DataReader interface and its implementation (FileDataReader) to support metadata extraction and management. The changes improve interoperability with Retrieval-Augmented Generation (RAG) workflows and provide a modular approach to embedding metadata into documents.
Key Changes
-
Document Class Enhancements:
- Added a
metadataproperty to store key-value pairs of extensible metadata. - Introduced
addMetadataandtoArraymethods to manage and serialize metadata.
- Added a
-
DataReader Interface:
- Added an
extractMetadatamethod to standardize metadata extraction from document content.
- Added an
-
FileDataReader:
- Implemented the
extractMetadatamethod to parse and populate metadata fields from the content. - Automatically populates metadata during the creation of
Documentobjects.
- Implemented the
-
DocumentUtils Enhancements:
- Updated utility functions to support metadata when creating documents from arrays.
- Ensured compatibility with new and existing functionality.
-
Tests:
- Added tests to validate metadata extraction, assignment, and serialization.
- Ensured all existing tests remain functional to maintain backward compatibility.
Benefits of This Contribution
-
Enhanced Metadata Support:
- Metadata (e.g., titles, categories, tags) can now be embedded into
Documentobjects, providing rich context for document retrieval and organization.
- Metadata (e.g., titles, categories, tags) can now be embedded into
-
Improved RAG Workflows:
- Metadata is critical for Retrieval-Augmented Generation (RAG) workflows, enabling:
- Better search and filtering: Use metadata to refine searches in vector stores.
- Increased accuracy: Improve relevance by aligning metadata with document embeddings.
- Custom queries: Leverage metadata fields for fine-grained information retrieval.
- Metadata is critical for Retrieval-Augmented Generation (RAG) workflows, enabling:
-
Extensibility:
- Developers can easily add new metadata fields without modifying the core library.
- The
extractMetadatamethod allows custom data readers to parse metadata from diverse file formats.
-
Backward Compatibility:
- All existing functionality remains unchanged, ensuring no breaking changes for current users.
-
Ease of Integration:
- Standardized metadata handling ensures smooth integration with vector stores, such as Qdrant or Pinecone.
How to Use the Enhanced Features
1. Add Metadata in a Custom Data Reader
Implement the extractMetadata method in a custom data reader to define how metadata is parsed:
use LLPhant\Embeddings\DataReader\DataReader;
use LLPhant\Embeddings\Document;
class MyCustomDataReader implements DataReader
{
public function getDocuments(): array
{
$content = "Sample document content";
$document = new Document();
$document->content = $content;
// Extract and add metadata
$metadata = $this->extractMetadata($content);
foreach ($metadata as $key => $value) {
$document->addMetadata($key, $value);
}
return [$document];
}
public function extractMetadata(string $content): array
{
// Custom metadata extraction logic
return [
'title' => 'Extracted Title',
'category' => 'Extracted Category',
// Add more metadata fields as needed
];
}
}
2. Retrieve Metadata in a RAG Workflow
Use the toArray method to serialize documents with metadata:
$documents = $dataReader->getDocuments();
foreach ($documents as $document) {
$metadata = $document->metadata;
print_r($metadata); // Outputs: ['title' => 'Extracted Title', 'category' => 'Extracted Category']
}
3. Store Metadata in Vector Stores
Combine content and metadata for embedding in vector stores:
foreach ($documents as $document) {
$vectorStore->upsert([
'id' => DocumentUtils::getUniqueId($document),
'embedding' => $document->embedding,
'metadata' => $document->metadata,
]);
}
Use Cases with RAG Workflows
1. Metadata-Driven Retrieval
-
Scenario: Search documents by
categoryortagsbefore applying semantic search on embeddings. -
Query Example:
{ "filter": { "category": "User Guide" }, "vector": [0.12, 0.34, 0.56], "top_k": 5 }
2. Context-Aware Augmented Responses
-
Scenario: Include metadata (e.g.,
sourceType,title) in AI responses to provide additional context. -
Example:
"The information comes from the document titled 'User Guide for Product X'."
3. Chunk-Based Metadata
- Scenario: Manage individual chunks of large documents using metadata.
- Implementation:
- Add a
chunkNumberto metadata for better traceability and reconstruction.
- Add a
Potential Enhancements
-
User-Defined Metadata Parsers:
- Allow users to plug in custom metadata parsing logic for different file types.
-
Utility Methods for Metadata Queries:
- Provide methods like
findDocumentsByMetadatato simplify metadata-based retrieval.
- Provide methods like
-
Examples for Specific Vector Stores:
- Add documentation or examples showing integration with vector stores like Qdrant and Pinecone.
Checklist
- [x] Added metadata support to
Document. - [x] Updated
DataReaderandFileDataReaderfor metadata handling. - [x] Enhanced
DocumentUtilsfor metadata compatibility. - [x] Included comprehensive tests.
- [x] Validated backward compatibility.
Let me know if you need any further adjustments or additional information! 🚀
Hey @raihan-js , Thanks a lot for that and sorry for the late reply. This is really good. How can I help you finish it?
Hey @MaximeThoonsen, Thanks for getting back to me. This is completed, but failing the tests here, if you look into it, maybe help me out passing the tests here? Or any suggestions?
@raihan-js You can use composer lintto fix to linting problem. And the for unit test I suggested you a change in a comment
please merge and release
please merge and release
Sorry, but we cannot merge this PR as it is, since it would break many vector stores implementations, as you can see running
composer test:types