lance
lance copied to clipboard
Request for compare/contrast with other solutions
Hi, thanks for your previous help,
Was wondering if you could provide just a high level overview comparing the pros/cons of Lance vs other solutions such as:
- TileDB (array / dataframe database)
- Zarr
- HDF5
- other array-like DBs
- or even Postgres with a single row and array fields storing all data
After coming to pyarrow, I realized the random-access speed was limiting, so after finding Lance I was surprised it performed well at this. Then I realized Lance looks like a database on disk with the separate data files, transactions, metadata, etc. So after researching a bit, I thought the best solution for random-access without loading it all into RAM would be some sort of array database that is optimized and purpose-built for performing lookups on a supplied index. Thus I found TileDB existed, and there are probably lots of others too. I mean, an array-only db seems very simple compared to postgres or other DBs. This concept must have existed for decades by now, but if not I'd be surprised. Any comparison between Lance and other solutions would be very cool!
Thanks!
There's a lot we could say, but to keep it short:
- Versus Postgres: Lance is a data lake format, so you have separate compute and storage. Whereas Postgres and similar DBs you need to manage instances and keep them up.
- Zarr and HDF5: I'm not too familiar with these, but I think they seem to be more array based and hierachical. Whereas Lance is tabular (based on Apache Arrow) with a focus on arrays. So while they might have somewhat better support for Numpy, our support for OLAP query engines like DuckDB and Polars is much better.
- TileDB: I haven't look at this before. But looks like a database focused on arrays and multi-model data, which is similar to Lance. However, I don't see anything on secondary indices, particularly vector search indices. So I think those indices are the main benefit of Lance over that system.