🐛 Bug Report: Inconsistency in recorded data across different vector databases
Which component is this bug for?
All Packages
📜 Description
Tried using Traceloop version 0.26.4 with different vector databases while trying to run a RAG application (watsonx + langchain) and observed some difference in behaviour. I was expecting the behaviour across all vector dbs to be uniform with regards to the span information. I tested using Milvus, Pinecone and Chroma where Milvus and Chroma were both tested using in memory option with langchain and for pinecone I tried the managed instance. Observations :
- Chroma - captures less information in the vector db related spans - captures embedding count , gives similarity value but does not give all 4 retrieved chunks information. It just returns one chunk and specifying the parameters in the as_retriever() method does not seem to have any effect on the span information collected. reference for as_retriever: https://api.python.langchain.com/en/v0.1/vectorstores/langchain_astradb.vectorstores.AstraDBVectorStore.html#langchain_astradb.[…]Store.as_retriever
- Pinecone - Does not capture the embedding count nor does it give similarity value but I could see the top 4 retrieved documents as part of another span.
- Milvus - does not capture the embedding count nor does it give the similarity value , there seems to be some problem with the retrieved context also.
👟 Reproduction steps
Steps can be reproduced by trying the RAG sample from langchain using different vector databases. LLM used is from watsonx with langchain framework.
https://python.langchain.com/v0.1/docs/use_cases/question_answering/quickstart/
👍 Expected behavior
Ideally the following information should be captured consistently across all vector dbs
- embeddings details - such as count (for the stored knowledge base) and any additional information
- query embeddings and other details
- retrieved context information - no of chunks matched, should return all matched chunks as per the configuration parameters set for the retriever (mentioned in number 5)
- retrieval parameters configured should influence the actual results generated , for eg : similarity algorithm to use for searching query against the stored docs, number of documents to retrieve , similarity threshold etc
- any insights on the chunk(s) used for the final answer generation.
👎 Actual Behavior with Screenshots
Most information is missing and the behaviour is definitely not consistent.
🤖 Python Version
3.10
📃 Provide any additional context for the Bug.
No response
👀 Have you spent some time to check if this bug has been raised before?
- [X] I checked and didn't find similar issue
Are you willing to submit PR?
None
An update to the Expected Behaviour point 1 - We should be able to capture the embedding model information as well. thank you.
@nirga is this issue open, would like to work on this ?
Yes @cu8code!
Hey @cu8code Are you still interesting on working on it or i can take it from here?
@nirga is this issue open, would like to work on this ?
Hello
🔧 Proposed Solution for Issue #1870
Hi @traceloop team! 👋
I've analyzed the inconsistency problem across vector databases and have developed a comprehensive solution. I'd love to get your feedback before submitting a full PR.
🔍 Root Cause Analysis
After examining the codebase, I identified the core issues:
- Chroma:
wrapper.pyonly records 1 chunk instead of all requested (n_resultsignored) - Pinecone:
query_handlers.pymissing embedding count capture and similarity values in main span - Milvus:
wrapper.pymissing embedding count extraction and similarity value recording - All: No standardization framework across vector databases
✨ Proposed Solution Architecture
I've created a standardized approach with:
1. Base Framework
class BaseVectorDBInstrumentor:
def record_query_start(self, span, query_params) # Standardized query info
def record_query_embeddings(self, span, embeddings) # Consistent embedding events
def record_retrieval_results(self, span, results) # Complete result recording
def _normalize_results(self, results) # Handle all DB formats
2. Enhanced Database Wrappers
- Chroma: Enhanced
_add_query_result_events()to record ALL chunks (not just one) - Pinecone: Enhanced
set_query_result_attributes()to capture embedding count + consolidate results - Milvus: Enhanced
_set_search_attributes()to extract embedding count + record similarities
🎯 Key Improvements
Before (Current Issues):
Chroma: Shows 1/5 chunks ❌, no similarity scores ❌
Pinecone: Missing embedding count ❌, results in separate span ❌
Milvus: Missing embedding count ❌, no similarity values ❌
After (With Solution):
All DBs: Shows 5/5 chunks ✅, with similarity scores ✅
All DBs: Embedding count captured ✅, consistent events ✅
All DBs: Standardized attributes ✅, backward compatible ✅
🔄 Backward Compatibility
✅ Zero Breaking Changes: All existing attributes preserved
✅ Additive Only: New standardized attributes added alongside existing ones
✅ Gradual Migration: Users can adopt new attributes at their own pace
📊 Implementation Scope
Files to Modify:
packages/opentelemetry-instrumentation-chromadb/opentelemetry/instrumentation/chromadb/wrapper.pypackages/opentelemetry-instrumentation-pinecone/opentelemetry/instrumentation/pinecone/query_handlers.pypackages/opentelemetry-instrumentation-milvus/opentelemetry/instrumentation/milvus/wrapper.py
New Files to Add:
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/vector_db.py(standardized attributes)- Comprehensive test suite for cross-database consistency
✅ Validation Approach
I've created a complete test suite that validates:
- All databases provide identical span structure
- Complete chunk retrieval (not partial results)
- Embedding counts captured across all DBs
- Similarity values/distances recorded
- Configuration parameters properly respected
❓ Questions for Maintainers
- Approach Approval: Does this standardization approach align with your vision?
- Implementation Preference: Would you prefer this as a single large PR or split by database?
- Attribute Naming: Any preferences for the standardized attribute names?
- Testing Strategy: Should I include the cross-database consistency tests in the PR?
📋 Next Steps
If this approach looks good, I can:
- Submit complete PR with all enhanced implementations
- Include comprehensive tests validating the fixes
- Provide migration documentation for users
- Add performance benchmarks if needed
🔗 Solution Preview
I have complete implementations ready including:
- Enhanced instrumentors for all 3 databases
- Standardized base framework
- Comprehensive test suite
- Migration guide and documentation
Would love to hear your thoughts and get approval to proceed with the full PR! 🚀
Related: This directly addresses all issues mentioned in the original bug report and provides a foundation for consistent vector database instrumentation going forward.