bug(python): hybrid search crashes when combining fts and vector results
LanceDB version
v0.13.0
What happened?
(table
.search(query_type='hybrid', vector_column_name='vector')
.vector(phrase_embedding)
.text(phrase)
.where(condition_string)
.limit(10)
.to_arrow()
)
Traceback (most recent call last):
File "/Users/james/exp/2024-10-04/run.py", line 99, in <module>
hybrid_results = hybrid_search(phrase, phrase_embedding, condition_string)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/james/exp/2024-10-04/run.py", line 78, in hybrid_search
.to_arrow()
^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/centauri-ai/lib/python3.12/site-packages/lancedb/query.py", line 1059, in to_arrow
results = self._reranker.rerank_hybrid(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/centauri-ai/lib/python3.12/site-packages/lancedb/rerankers/rrf.py", line 64, in rerank_hybrid
combined_results = self.merge_results(vector_results, fts_results)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/anaconda3/envs/centauri-ai/lib/python3.12/site-packages/lancedb/rerankers/base.py", line 147, in merge_results
combined = pa.concat_tables(
^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 5309, in pyarrow.lib.concat_tables
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<bounds: list<item: list<item: int64>>, file_name: string> output fields: struct<file_name: string, bounds: list<item: list<item: int64>>>
expected: no exception when performing hybrid search
Are there known steps to reproduce?
source code
import lancedb
import openai
import pandas as pd
import pyarrow as pa
openai_client = openai.OpenAI(
api_key = "redacted",
organization='org-redacted'
)
def get_embedding(content):
result = openai_client.embeddings.create(
input=content,
model='text-embedding-3-small',
)
return result.data[0].embedding
contents = [
'Water Margin',
'Journey to the West',
'Romance of the Three Kingdoms',
'Dream of the Red Chamber',
]
embeddings = [{
'content': content,
'vector': get_embedding(content),
'metadata': {
'file_name': 'dummy.pdf',
'bounds': [[0, 0], [1, 1]],
},
} for content in contents]
embeddings_df = pd.DataFrame(embeddings)
vdb = lancedb.connect('.lancedb')
table_name = 'my_table'
if table_name in vdb.table_names():
vdb.drop_table(table_name)
table = vdb.create_table('my_table', embeddings_df)
table.create_fts_index('content')
phrase = "computation of interest"
phrase_embedding = get_embedding(phrase)
condition_string = "metadata.file_name = 'dummy.pdf'"
def full_text_search(phrase, condition_string):
return (table
.search(phrase)
.where(condition_string)
.limit(10)
.to_arrow()
)
def vector_search(phrase_embedding, condition_string):
return (table
.search(phrase_embedding)
.where(condition_string)
.limit(10)
.to_arrow()
)
def hybrid_search(phrase, phrase_embedding, condition_string):
return (table
.search(query_type='hybrid', vector_column_name='vector')
.vector(phrase_embedding)
.text(phrase)
.where(condition_string)
.limit(10)
.to_arrow()
)
fts_results = full_text_search(phrase, condition_string)
vs_results = vector_search(phrase_embedding, condition_string)
print('Full Text Search Schema:')
print(fts_results.schema)
print('---')
print('Vector Search Schema:')
print(vs_results.schema)
# pa.concat_tables([fts_results, vs_results], promote_options='default')
hybrid_results = hybrid_search(phrase, phrase_embedding, condition_string)
print(hybrid_results)
i performed some testing and this is only reproducible under these conditions:
- hybrid search
- lancedb version >=0.11.0 (i tested with 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, and 0.13.0)
- i believe the original change was https://github.com/lancedb/lancedb/pull/1456
- in the table schema,
file_nameis embedded undermetadata.file_name. i can't repro iffile_nameis a root-level table column name
these are the fts / vector search result schemas:
$ python run.py
Full Text Search Schema:
content: string
vector: fixed_size_list<item: float>[1536]
child 0, item: float
metadata: struct<bounds: list<item: list<item: int64>>, file_name: string>
child 0, bounds: list<item: list<item: int64>>
child 0, item: list<item: int64>
child 0, item: int64
child 1, file_name: string
_score: double
---
Vector Search Schema:
content: string
vector: fixed_size_list<item: float>[1536]
child 0, item: float
metadata: struct<file_name: string, bounds: list<item: list<item: int64>>>
child 0, file_name: string
child 1, bounds: list<item: list<item: int64>>
child 0, item: list<item: int64>
child 0, item: int64
_distance: float
as you can see, FTS yields [bounds, file_name], whereas vector search yields [file_name, bounds]. prior to v0.11.0, both schemas are the same as [bounds, file_name]
I have also tried manually specifying a schema via:
vdb.create_table('my_table', embeddings_df, schema=my_schema)
to attempt to enforce correct ordering of schema fields
my_schema
from pyarrow import field, float32, float64, int64, list_, schema, string, struct
my_schema = schema(
[
field(
'metadata',
struct(
[
field('bounds', list_(list_(float64()))),
field('file_name', string()),
]
),
),
field('content', string()),
field('vector', list_(float32(), list_size=1536)), # text-embedding-3-small
]
)
to no avail
Your example to reproduce works fine from the main branch (but is reproducible in the latest stable release), so I think it might have been fixed. Can you try again after building from source? Can you also specify what OS you're using? I tested it on mac
I can confirm the issue is no longer reproducible on the latest main branch (i'm on e61ba7f at time of writing). I built with mac and installed the wheel into my testing virtualenv.
(lancedb) JamesMPB:python james$ pwd
/Users/james/src/lancedb/python
(lancedb) JamesMPB:python james$ maturin develop
...
(lancedb) JamesMPB:python james$ readlink -f ../target/wheels/lancedb-0.14.0b0-cp38-abi3-macosx_11_0_arm64.whl
/Users/james/src/lancedb/target/wheels/lancedb-0.14.0b0-cp38-abi3-macosx_11_0_arm64.whl
(2024-10-01) JamesMPB:2024-10-04 james$ pip install /Users/james/src/lancedb/target/wheels/lancedb-0.14.0b0-cp38-abi3-macosx_11_0_arm64.whl
(2024-10-01) JamesMPB:2024-10-04 james$ python run.py
stdout
Full Text Search Schema:
content: string
vector: fixed_size_list<item: float>[1536]
child 0, item: float
metadata: struct<file_name: string, bounds: list<item: list<item: int64>>>
child 0, file_name: string
child 1, bounds: list<item: list<item: int64>>
child 0, item: list<item: int64>
child 0, item: int64
_score: double
---
Vector Search Schema:
content: string
vector: fixed_size_list<item: float>[1536]
child 0, item: float
metadata: struct<file_name: string, bounds: list<item: list<item: int64>>>
child 0, file_name: string
child 1, bounds: list<item: list<item: int64>>
child 0, item: list<item: int64>
child 0, item: int64
_distance: float
pyarrow.Table
content: string
vector: fixed_size_list<item: float>[1536]
child 0, item: float
metadata: struct<file_name: string, bounds: list<item: list<item: int64>>>
child 0, file_name: string
child 1, bounds: list<item: list<item: int64>>
child 0, item: list<item: int64>
child 0, item: int64
_relevance_score: float
----
content: [["红楼梦 (Dream of the Red Chamber)","三国演义 (Romance of the Three Kingdoms)","水浒传 (Water Margin)","西游记 (Journey to the West)"]]
vector: [[[-0.023994146,-0.007690431,-0.017557846,0.029957188,0.00567613,...,-0.001564707,0.015783131,-0.030170154,0.0018752821,-0.035565287],[-0.018358452,0.029828534,-0.0016523133,-0.026921516,-0.008531466,...,-0.01299732,0.02709004,0.0059562274,0.029070182,-0.008262883],[0.037150122,0.05927541,0.0033321213,0.014491699,-0.019689808,...,0.00902702,0.012001695,-0.0043256995,0.027408212,-0.012649944],[0.01638289,0.0028170177,0.009147022,-0.0048110243,-0.060096376,...,0.016791634,0.007092256,-0.0031208135,0.02439206,-0.012571632]]]
metadata: [
-- is_valid: all not null
-- child 0 type: string
["dummy.pdf","dummy.pdf","dummy.pdf","dummy.pdf"]
-- child 1 type: list<item: list<item: int64>>
[[[0,0],[1,1]],[[0,0],[1,1]],[[0,0],[1,1]],[[0,0],[1,1]]]]
_relevance_score: [[0.032522473,0.032522473,0.015873017,0.015625]]
I did some additional testing, looks like the change that fixed the issue is the upgrade to pylance==0.18.0. My best guess is that the PR that fixed it was https://github.com/lancedb/lance/pull/2836 , just based on scanning the release notes alone.
$ pip freeze | grep lance
lancedb==0.13.0
pylance==0.17.0
$ python run.py
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<bounds: list<item: list<item: int64>>, file_name: string> output fields: struct<file_name: string, bounds: list<item: list<item: int64>>>
$ pip install pylance==0.18.0
$ pip freeze | grep lance
lancedb==0.13.0
pylance==0.18.0
$ python run.py
👍