openllmetry icon indicating copy to clipboard operation
openllmetry copied to clipboard

🐛 Bug Report: Inconsistency in recorded data across different vector databases

Open LakshmiN5 opened this issue 1 year ago • 6 comments

Which component is this bug for?

All Packages

📜 Description

Tried using Traceloop version 0.26.4 with different vector databases while trying to run a RAG application (watsonx + langchain) and observed some difference in behaviour. I was expecting the behaviour across all vector dbs to be uniform with regards to the span information. I tested using Milvus, Pinecone and Chroma where Milvus and Chroma were both tested using in memory option with langchain and for pinecone I tried the managed instance. Observations :

  • Chroma - captures less information in the vector db related spans - captures embedding count , gives similarity value but does not give all 4 retrieved chunks information. It just returns one chunk and specifying the parameters in the as_retriever() method does not seem to have any effect on the span information collected. reference for as_retriever: https://api.python.langchain.com/en/v0.1/vectorstores/langchain_astradb.vectorstores.AstraDBVectorStore.html#langchain_astradb.[…]Store.as_retriever
  • Pinecone - Does not capture the embedding count nor does it give similarity value but I could see the top 4 retrieved documents as part of another span.
  • Milvus - does not capture the embedding count nor does it give the similarity value , there seems to be some problem with the retrieved context also.

👟 Reproduction steps

Steps can be reproduced by trying the RAG sample from langchain using different vector databases. LLM used is from watsonx with langchain framework.

https://python.langchain.com/v0.1/docs/use_cases/question_answering/quickstart/

👍 Expected behavior

Ideally the following information should be captured consistently across all vector dbs

  1. embeddings details - such as count (for the stored knowledge base) and any additional information
  2. query embeddings and other details
  3. retrieved context information - no of chunks matched, should return all matched chunks as per the configuration parameters set for the retriever (mentioned in number 5)
  4. retrieval parameters configured should influence the actual results generated , for eg : similarity algorithm to use for searching query against the stored docs, number of documents to retrieve , similarity threshold etc
  5. any insights on the chunk(s) used for the final answer generation.

👎 Actual Behavior with Screenshots

Most information is missing and the behaviour is definitely not consistent.

🤖 Python Version

3.10

📃 Provide any additional context for the Bug.

No response

👀 Have you spent some time to check if this bug has been raised before?

  • [X] I checked and didn't find similar issue

Are you willing to submit PR?

None

LakshmiN5 avatar Aug 19 '24 12:08 LakshmiN5

An update to the Expected Behaviour point 1 - We should be able to capture the embedding model information as well. thank you.

LakshmiN5 avatar Aug 19 '24 14:08 LakshmiN5

@nirga is this issue open, would like to work on this ?

0hmX avatar Sep 22 '24 14:09 0hmX

Yes @cu8code!

nirga avatar Sep 22 '24 14:09 nirga

Hey @cu8code Are you still interesting on working on it or i can take it from here?

@nirga is this issue open, would like to work on this ?

Sh950 avatar Jan 13 '25 02:01 Sh950

Hello

Prashant-cyber394 avatar Mar 27 '25 15:03 Prashant-cyber394

🔧 Proposed Solution for Issue #1870

Hi @traceloop team! 👋

I've analyzed the inconsistency problem across vector databases and have developed a comprehensive solution. I'd love to get your feedback before submitting a full PR.

🔍 Root Cause Analysis

After examining the codebase, I identified the core issues:

  1. Chroma: wrapper.py only records 1 chunk instead of all requested (n_results ignored)
  2. Pinecone: query_handlers.py missing embedding count capture and similarity values in main span
  3. Milvus: wrapper.py missing embedding count extraction and similarity value recording
  4. All: No standardization framework across vector databases

✨ Proposed Solution Architecture

I've created a standardized approach with:

1. Base Framework

class BaseVectorDBInstrumentor:
    def record_query_start(self, span, query_params)     # Standardized query info
    def record_query_embeddings(self, span, embeddings)  # Consistent embedding events  
    def record_retrieval_results(self, span, results)    # Complete result recording
    def _normalize_results(self, results)                # Handle all DB formats

2. Enhanced Database Wrappers

  • Chroma: Enhanced _add_query_result_events() to record ALL chunks (not just one)
  • Pinecone: Enhanced set_query_result_attributes() to capture embedding count + consolidate results
  • Milvus: Enhanced _set_search_attributes() to extract embedding count + record similarities

🎯 Key Improvements

Before (Current Issues):

Chroma:   Shows 1/5 chunks ❌, no similarity scores ❌
Pinecone: Missing embedding count ❌, results in separate span ❌  
Milvus:   Missing embedding count ❌, no similarity values ❌

After (With Solution):

All DBs:  Shows 5/5 chunks ✅, with similarity scores ✅
All DBs:  Embedding count captured ✅, consistent events ✅
All DBs:  Standardized attributes ✅, backward compatible ✅

🔄 Backward Compatibility

Zero Breaking Changes: All existing attributes preserved
Additive Only: New standardized attributes added alongside existing ones
Gradual Migration: Users can adopt new attributes at their own pace

📊 Implementation Scope

Files to Modify:

  1. packages/opentelemetry-instrumentation-chromadb/opentelemetry/instrumentation/chromadb/wrapper.py
  2. packages/opentelemetry-instrumentation-pinecone/opentelemetry/instrumentation/pinecone/query_handlers.py
  3. packages/opentelemetry-instrumentation-milvus/opentelemetry/instrumentation/milvus/wrapper.py

New Files to Add:

  1. packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/vector_db.py (standardized attributes)
  2. Comprehensive test suite for cross-database consistency

✅ Validation Approach

I've created a complete test suite that validates:

  • All databases provide identical span structure
  • Complete chunk retrieval (not partial results)
  • Embedding counts captured across all DBs
  • Similarity values/distances recorded
  • Configuration parameters properly respected

❓ Questions for Maintainers

  1. Approach Approval: Does this standardization approach align with your vision?
  2. Implementation Preference: Would you prefer this as a single large PR or split by database?
  3. Attribute Naming: Any preferences for the standardized attribute names?
  4. Testing Strategy: Should I include the cross-database consistency tests in the PR?

📋 Next Steps

If this approach looks good, I can:

  1. Submit complete PR with all enhanced implementations
  2. Include comprehensive tests validating the fixes
  3. Provide migration documentation for users
  4. Add performance benchmarks if needed

🔗 Solution Preview

I have complete implementations ready including:

  • Enhanced instrumentors for all 3 databases
  • Standardized base framework
  • Comprehensive test suite
  • Migration guide and documentation

Would love to hear your thoughts and get approval to proceed with the full PR! 🚀


Related: This directly addresses all issues mentioned in the original bug report and provides a foundation for consistent vector database instrumentation going forward.

ankan288 avatar Oct 14 '25 16:10 ankan288