openllmetry 🐛 Bug Report: Inconsistency in recorded data across different vector databases

Which component is this bug for?

All Packages

📜 Description

Tried using Traceloop version 0.26.4 with different vector databases while trying to run a RAG application (watsonx + langchain) and observed some difference in behaviour. I was expecting the behaviour across all vector dbs to be uniform with regards to the span information. I tested using Milvus, Pinecone and Chroma where Milvus and Chroma were both tested using in memory option with langchain and for pinecone I tried the managed instance. Observations :

Chroma - captures less information in the vector db related spans - captures embedding count , gives similarity value but does not give all 4 retrieved chunks information. It just returns one chunk and specifying the parameters in the as_retriever() method does not seem to have any effect on the span information collected. reference for as_retriever: https://api.python.langchain.com/en/v0.1/vectorstores/langchain_astradb.vectorstores.AstraDBVectorStore.html#langchain_astradb.[…]Store.as_retriever
Pinecone - Does not capture the embedding count nor does it give similarity value but I could see the top 4 retrieved documents as part of another span.
Milvus - does not capture the embedding count nor does it give the similarity value , there seems to be some problem with the retrieved context also.

👟 Reproduction steps

Steps can be reproduced by trying the RAG sample from langchain using different vector databases. LLM used is from watsonx with langchain framework.

https://python.langchain.com/v0.1/docs/use_cases/question_answering/quickstart/

👍 Expected behavior

Ideally the following information should be captured consistently across all vector dbs

embeddings details - such as count (for the stored knowledge base) and any additional information
query embeddings and other details
retrieved context information - no of chunks matched, should return all matched chunks as per the configuration parameters set for the retriever (mentioned in number 5)
retrieval parameters configured should influence the actual results generated , for eg : similarity algorithm to use for searching query against the stored docs, number of documents to retrieve , similarity threshold etc
any insights on the chunk(s) used for the final answer generation.

👎 Actual Behavior with Screenshots

Most information is missing and the behaviour is definitely not consistent.

🤖 Python Version

3.10

📃 Provide any additional context for the Bug.

No response

👀 Have you spent some time to check if this bug has been raised before?

[X] I checked and didn't find similar issue

Are you willing to submit PR?

None

Aug 19 '24 12:08 LakshmiN5

An update to the Expected Behaviour point 1 - We should be able to capture the embedding model information as well. thank you.

Aug 19 '24 14:08 LakshmiN5

@nirga is this issue open, would like to work on this ?

Sep 22 '24 14:09 0hmX

Yes @cu8code!

Sep 22 '24 14:09 nirga

Hey @cu8code Are you still interesting on working on it or i can take it from here?

@nirga is this issue open, would like to work on this ?

Jan 13 '25 02:01 Sh950

Hello

Mar 27 '25 15:03 Prashant-cyber394

🔧 Proposed Solution for Issue #1870

Hi @traceloop team! 👋

I've analyzed the inconsistency problem across vector databases and have developed a comprehensive solution. I'd love to get your feedback before submitting a full PR.

🔍 Root Cause Analysis

After examining the codebase, I identified the core issues:

Chroma: wrapper.py only records 1 chunk instead of all requested (n_results ignored)
Pinecone: query_handlers.py missing embedding count capture and similarity values in main span
Milvus: wrapper.py missing embedding count extraction and similarity value recording
All: No standardization framework across vector databases

✨ Proposed Solution Architecture

I've created a standardized approach with:

1. Base Framework

class BaseVectorDBInstrumentor:
    def record_query_start(self, span, query_params)     # Standardized query info
    def record_query_embeddings(self, span, embeddings)  # Consistent embedding events  
    def record_retrieval_results(self, span, results)    # Complete result recording
    def _normalize_results(self, results)                # Handle all DB formats

2. Enhanced Database Wrappers

Chroma: Enhanced _add_query_result_events() to record ALL chunks (not just one)
Pinecone: Enhanced set_query_result_attributes() to capture embedding count + consolidate results
Milvus: Enhanced _set_search_attributes() to extract embedding count + record similarities

🎯 Key Improvements

Before (Current Issues):

Chroma:   Shows 1/5 chunks ❌, no similarity scores ❌
Pinecone: Missing embedding count ❌, results in separate span ❌  
Milvus:   Missing embedding count ❌, no similarity values ❌

After (With Solution):

All DBs:  Shows 5/5 chunks ✅, with similarity scores ✅
All DBs:  Embedding count captured ✅, consistent events ✅
All DBs:  Standardized attributes ✅, backward compatible ✅

🔄 Backward Compatibility

✅ Zero Breaking Changes: All existing attributes preserved
✅ Additive Only: New standardized attributes added alongside existing ones
✅ Gradual Migration: Users can adopt new attributes at their own pace

📊 Implementation Scope

Files to Modify:

packages/opentelemetry-instrumentation-chromadb/opentelemetry/instrumentation/chromadb/wrapper.py
packages/opentelemetry-instrumentation-pinecone/opentelemetry/instrumentation/pinecone/query_handlers.py
packages/opentelemetry-instrumentation-milvus/opentelemetry/instrumentation/milvus/wrapper.py

New Files to Add:

packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/vector_db.py (standardized attributes)
Comprehensive test suite for cross-database consistency

✅ Validation Approach

I've created a complete test suite that validates:

All databases provide identical span structure
Complete chunk retrieval (not partial results)
Embedding counts captured across all DBs
Similarity values/distances recorded
Configuration parameters properly respected

❓ Questions for Maintainers

Approach Approval: Does this standardization approach align with your vision?
Implementation Preference: Would you prefer this as a single large PR or split by database?
Attribute Naming: Any preferences for the standardized attribute names?
Testing Strategy: Should I include the cross-database consistency tests in the PR?

📋 Next Steps

If this approach looks good, I can:

Submit complete PR with all enhanced implementations
Include comprehensive tests validating the fixes
Provide migration documentation for users
Add performance benchmarks if needed

🔗 Solution Preview

I have complete implementations ready including:

Enhanced instrumentors for all 3 databases
Standardized base framework
Comprehensive test suite
Migration guide and documentation

Would love to hear your thoughts and get approval to proceed with the full PR! 🚀

Related: This directly addresses all issues mentioned in the original bug report and provides a foundation for consistent vector database instrumentation going forward.

Oct 14 '25 16:10 ankan288