lance icon indicating copy to clipboard operation
lance copied to clipboard

Request for compare/contrast with other solutions

Open billnye2 opened this issue 11 months ago • 1 comments

Hi, thanks for your previous help,

Was wondering if you could provide just a high level overview comparing the pros/cons of Lance vs other solutions such as:

  • TileDB (array / dataframe database)
  • Zarr
  • HDF5
  • other array-like DBs
  • or even Postgres with a single row and array fields storing all data

After coming to pyarrow, I realized the random-access speed was limiting, so after finding Lance I was surprised it performed well at this. Then I realized Lance looks like a database on disk with the separate data files, transactions, metadata, etc. So after researching a bit, I thought the best solution for random-access without loading it all into RAM would be some sort of array database that is optimized and purpose-built for performing lookups on a supplied index. Thus I found TileDB existed, and there are probably lots of others too. I mean, an array-only db seems very simple compared to postgres or other DBs. This concept must have existed for decades by now, but if not I'd be surprised. Any comparison between Lance and other solutions would be very cool!

Thanks!

billnye2 avatar Mar 08 '24 00:03 billnye2

There's a lot we could say, but to keep it short:

  • Versus Postgres: Lance is a data lake format, so you have separate compute and storage. Whereas Postgres and similar DBs you need to manage instances and keep them up.
  • Zarr and HDF5: I'm not too familiar with these, but I think they seem to be more array based and hierachical. Whereas Lance is tabular (based on Apache Arrow) with a focus on arrays. So while they might have somewhat better support for Numpy, our support for OLAP query engines like DuckDB and Polars is much better.
  • TileDB: I haven't look at this before. But looks like a database focused on arrays and multi-model data, which is similar to Lance. However, I don't see anything on secondary indices, particularly vector search indices. So I think those indices are the main benefit of Lance over that system.

wjones127 avatar Mar 11 '24 21:03 wjones127