cocoindex icon indicating copy to clipboard operation
cocoindex copied to clipboard

[FEATURE] support vector and FTS indexes in lancedb connector

Open badmonster0 opened this issue 3 weeks ago • 5 comments

What is the use case? support vector and FTS indexes in lancedb connector


❤️ Contributors, please refer to 📙Contributing Guide. Unless the PR can be sent immediately (e.g. just a few lines of code), we recommend you to leave a comment on the issue like I'm working on it or Can I work on this issue? to avoid duplicating work. Our Discord server is always open and friendly.

badmonster0 avatar Dec 01 '25 21:12 badmonster0

Created #1358. During test, noticed LanceDB's LTS support is for Cloud and Enterprise edition only. Nevertheless, we can checkin this feature first and waiting for users response to see if there's any problems.

georgeh0 avatar Dec 02 '25 06:12 georgeh0

thanks @georgeh0 !! cc @prrao87

badmonster0 avatar Dec 02 '25 06:12 badmonster0

Thank you! Will try it out in some workflows and raise issues if they arise.

prrao87 avatar Dec 02 '25 13:12 prrao87

Hi @georgeh0, I think the current implementation of FTS indexes (specifically, this statement) is incorrect. LanceDB OSS does support FTS - but this is totally our fault (it's poorly documented, and I'm fixing that part here.

Basically, when working with a LanceDB AsyncTable and using an async API, which is what CocoIndex defaults to, we need to avoid using create_fts_index and instead use the more general method create_index, as defined in this async test in our test suite. The documentation issue arose on our end because an older API relied on tantivy (a full-text search engine in the Rust ecosystem), before we had an async Python API, but we then migrated to a native Lance format-supported FTS index to support async tables more fully.

This minimal example should show how to interface with the FTS index in LanceDB open source (no need for enterprise edition).

import asyncio
import lancedb
import polars as pl
from lancedb.index import FTS

data = pl.DataFrame(
    {
        "id": [1, 2],
        "text": ["His first language is spanish", "Her first language is english"],
    }
)

async def main(data: pl.DataFrame):
    uri = "ex_lancedb"
    db = await lancedb.connect_async(uri)
    tbl = await db.create_table("my_text", data=data, mode="overwrite")

    await tbl.create_index("text", config=FTS(language="English"))

    response = await tbl.search("spanish", query_type="fts")

    result = await response.limit(1).to_polars()
    print(result)


if __name__ == "__main__":
    asyncio.run(main(data))

Should output:

shape: (1, 3)
┌─────┬───────────────────────────────┬──────────┐
│ id  ┆ text                          ┆ _score   │
│ --- ┆ ---                           ┆ ---      │
│ i64 ┆ str                           ┆ f32      │
╞═════╪═══════════════════════════════╪══════════╡
│ 1   ┆ His first language is spanish ┆ 0.693147 │
└─────┴───────────────────────────────┴──────────┘

Hope this can be addressed inside cocoindex! For immediate reference, the test suite in test_index.py should contain all the async tests for index creation, and it's designed so the relevant index and their parameters can be passed in a more general way. For now, I think just a regular FTS index that accepts its relevant parameters, and an IVF_PQ index that also accepts its relevant parameters should be sufficient.

Thanks!

prrao87 avatar Dec 02 '25 21:12 prrao87

Thanks for keeping us informed @prrao87 , we will take a look! cc @georgeh0

badmonster0 avatar Dec 03 '25 01:12 badmonster0