lancedb bug(python): hybrid search crashes when combining fts and vector results

LanceDB version

v0.13.0

What happened?

(table
    .search(query_type='hybrid', vector_column_name='vector')
    .vector(phrase_embedding)
    .text(phrase)
    .where(condition_string)
    .limit(10)
    .to_arrow()
)

Traceback (most recent call last):
  File "/Users/james/exp/2024-10-04/run.py", line 99, in <module>
    hybrid_results = hybrid_search(phrase, phrase_embedding, condition_string)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/exp/2024-10-04/run.py", line 78, in hybrid_search
    .to_arrow()
     ^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/centauri-ai/lib/python3.12/site-packages/lancedb/query.py", line 1059, in to_arrow
    results = self._reranker.rerank_hybrid(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/centauri-ai/lib/python3.12/site-packages/lancedb/rerankers/rrf.py", line 64, in rerank_hybrid
    combined_results = self.merge_results(vector_results, fts_results)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/anaconda3/envs/centauri-ai/lib/python3.12/site-packages/lancedb/rerankers/base.py", line 147, in merge_results
    combined = pa.concat_tables(
               ^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 5309, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<bounds: list<item: list<item: int64>>, file_name: string> output fields: struct<file_name: string, bounds: list<item: list<item: int64>>>

expected: no exception when performing hybrid search

Are there known steps to reproduce?

source code

import lancedb
import openai
import pandas as pd
import pyarrow as pa

openai_client = openai.OpenAI(
    api_key = "redacted",
    organization='org-redacted'
)

def get_embedding(content):
    result = openai_client.embeddings.create(
        input=content,
        model='text-embedding-3-small',
    )
    return result.data[0].embedding


contents = [
    'Water Margin',
    'Journey to the West',
    'Romance of the Three Kingdoms',
    'Dream of the Red Chamber',
]

embeddings = [{
    'content': content,
    'vector': get_embedding(content),
    'metadata': {
        'file_name': 'dummy.pdf',
        'bounds': [[0, 0], [1, 1]],
    },
} for content in contents]

embeddings_df = pd.DataFrame(embeddings)

vdb = lancedb.connect('.lancedb')
table_name = 'my_table'
if table_name in vdb.table_names():
    vdb.drop_table(table_name)
table = vdb.create_table('my_table', embeddings_df)
table.create_fts_index('content')
phrase = "computation of interest"
phrase_embedding = get_embedding(phrase)
condition_string = "metadata.file_name = 'dummy.pdf'"

def full_text_search(phrase, condition_string):
    return (table
        .search(phrase)
        .where(condition_string)
        .limit(10)
        .to_arrow()
    )

def vector_search(phrase_embedding, condition_string):
    return (table
        .search(phrase_embedding)
        .where(condition_string)
        .limit(10)
        .to_arrow()
    )

def hybrid_search(phrase, phrase_embedding, condition_string):
    return (table
        .search(query_type='hybrid', vector_column_name='vector')
        .vector(phrase_embedding)
        .text(phrase)
        .where(condition_string)
        .limit(10)
        .to_arrow()
    )

fts_results = full_text_search(phrase, condition_string)
vs_results = vector_search(phrase_embedding, condition_string)

print('Full Text Search Schema:')
print(fts_results.schema)
print('---')
print('Vector Search Schema:')
print(vs_results.schema)

# pa.concat_tables([fts_results, vs_results], promote_options='default')

hybrid_results = hybrid_search(phrase, phrase_embedding, condition_string)
print(hybrid_results)

i performed some testing and this is only reproducible under these conditions:

hybrid search
lancedb version >=0.11.0 (i tested with 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, and 0.13.0)
- i believe the original change was https://github.com/lancedb/lancedb/pull/1456
in the table schema, file_name is embedded under metadata.file_name. i can't repro if file_name is a root-level table column name

these are the fts / vector search result schemas:

$ python run.py 
Full Text Search Schema:
content: string
vector: fixed_size_list<item: float>[1536]
  child 0, item: float
metadata: struct<bounds: list<item: list<item: int64>>, file_name: string>
  child 0, bounds: list<item: list<item: int64>>
      child 0, item: list<item: int64>
          child 0, item: int64
  child 1, file_name: string
_score: double
---
Vector Search Schema:
content: string
vector: fixed_size_list<item: float>[1536]
  child 0, item: float
metadata: struct<file_name: string, bounds: list<item: list<item: int64>>>
  child 0, file_name: string
  child 1, bounds: list<item: list<item: int64>>
      child 0, item: list<item: int64>
          child 0, item: int64
_distance: float

as you can see, FTS yields [bounds, file_name], whereas vector search yields [file_name, bounds]. prior to v0.11.0, both schemas are the same as [bounds, file_name]

Oct 05 '24 02:10 jameswu1991

I have also tried manually specifying a schema via:

vdb.create_table('my_table', embeddings_df, schema=my_schema)

to attempt to enforce correct ordering of schema fields

my_schema

from pyarrow import field, float32, float64, int64, list_, schema, string, struct

my_schema = schema(
    [
        field(
            'metadata',
            struct(
                [
                    field('bounds', list_(list_(float64()))),
                    field('file_name', string()),
                ]
            ),
        ),
        field('content', string()),
        field('vector', list_(float32(), list_size=1536)),  # text-embedding-3-small
    ]
)

to no avail

Oct 05 '24 03:10 jameswu1991

Your example to reproduce works fine from the main branch (but is reproducible in the latest stable release), so I think it might have been fixed. Can you try again after building from source? Can you also specify what OS you're using? I tested it on mac

Oct 05 '24 04:10 AyushExel

I can confirm the issue is no longer reproducible on the latest main branch (i'm on e61ba7f at time of writing). I built with mac and installed the wheel into my testing virtualenv.

(lancedb) JamesMPB:python james$ pwd
/Users/james/src/lancedb/python
(lancedb) JamesMPB:python james$ maturin develop
...
(lancedb) JamesMPB:python james$ readlink -f ../target/wheels/lancedb-0.14.0b0-cp38-abi3-macosx_11_0_arm64.whl 
/Users/james/src/lancedb/target/wheels/lancedb-0.14.0b0-cp38-abi3-macosx_11_0_arm64.whl

(2024-10-01) JamesMPB:2024-10-04 james$ pip install /Users/james/src/lancedb/target/wheels/lancedb-0.14.0b0-cp38-abi3-macosx_11_0_arm64.whl
(2024-10-01) JamesMPB:2024-10-04 james$ python run.py

stdout

Full Text Search Schema:
content: string
vector: fixed_size_list<item: float>[1536]
  child 0, item: float
metadata: struct<file_name: string, bounds: list<item: list<item: int64>>>
  child 0, file_name: string
  child 1, bounds: list<item: list<item: int64>>
      child 0, item: list<item: int64>
          child 0, item: int64
_score: double
---
Vector Search Schema:
content: string
vector: fixed_size_list<item: float>[1536]
  child 0, item: float
metadata: struct<file_name: string, bounds: list<item: list<item: int64>>>
  child 0, file_name: string
  child 1, bounds: list<item: list<item: int64>>
      child 0, item: list<item: int64>
          child 0, item: int64
_distance: float
pyarrow.Table
content: string
vector: fixed_size_list<item: float>[1536]
  child 0, item: float
metadata: struct<file_name: string, bounds: list<item: list<item: int64>>>
  child 0, file_name: string
  child 1, bounds: list<item: list<item: int64>>
      child 0, item: list<item: int64>
          child 0, item: int64
_relevance_score: float
----
content: [["红楼梦 (Dream of the Red Chamber)","三国演义 (Romance of the Three Kingdoms)","水浒传 (Water Margin)","西游记 (Journey to the West)"]]
vector: [[[-0.023994146,-0.007690431,-0.017557846,0.029957188,0.00567613,...,-0.001564707,0.015783131,-0.030170154,0.0018752821,-0.035565287],[-0.018358452,0.029828534,-0.0016523133,-0.026921516,-0.008531466,...,-0.01299732,0.02709004,0.0059562274,0.029070182,-0.008262883],[0.037150122,0.05927541,0.0033321213,0.014491699,-0.019689808,...,0.00902702,0.012001695,-0.0043256995,0.027408212,-0.012649944],[0.01638289,0.0028170177,0.009147022,-0.0048110243,-0.060096376,...,0.016791634,0.007092256,-0.0031208135,0.02439206,-0.012571632]]]
metadata: [
  -- is_valid: all not null
  -- child 0 type: string
["dummy.pdf","dummy.pdf","dummy.pdf","dummy.pdf"]
  -- child 1 type: list<item: list<item: int64>>
[[[0,0],[1,1]],[[0,0],[1,1]],[[0,0],[1,1]],[[0,0],[1,1]]]]
_relevance_score: [[0.032522473,0.032522473,0.015873017,0.015625]]

Oct 07 '24 21:10 jameswu1991

I did some additional testing, looks like the change that fixed the issue is the upgrade to pylance==0.18.0. My best guess is that the PR that fixed it was https://github.com/lancedb/lance/pull/2836 , just based on scanning the release notes alone.

$ pip freeze | grep lance
lancedb==0.13.0
pylance==0.17.0
$ python run.py
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<bounds: list<item: list<item: int64>>, file_name: string> output fields: struct<file_name: string, bounds: list<item: list<item: int64>>>

$ pip install pylance==0.18.0
$ pip freeze | grep lance
lancedb==0.13.0
pylance==0.18.0
$ python run.py
👍

Oct 07 '24 22:10 jameswu1991