lance icon indicating copy to clipboard operation
lance copied to clipboard

apply compression to lance files

Open prabhatsharma opened this issue 1 year ago • 3 comments

parquet files can be natively compressed using zstd or other mechanisms. Is it possible to apply compression to lance files?

prabhatsharma avatar Mar 15 '23 09:03 prabhatsharma

currently we haven't focused on compression. It will take a little more work to enable compression but still preserve random access performance. On the roadmap for sure though

changhiskhan avatar Mar 15 '23 20:03 changhiskhan

Our parquet files are huge compression is definetly a needed thing.

kesavkolla avatar Mar 15 '23 21:03 kesavkolla

Our parquet files are huge compression is definetly a needed thing.

Are you able to post the distribution here? Like what columns and types, any estimates for how much each column takes etc? That would definitely help us prioritize.

Also, you have timestamp data right? How regular is the interval? Eg hardware sensors tend to generate very regular intervals. Click streams generally have small and irregular intervals etc.

Thanks!

changhiskhan avatar Mar 16 '23 04:03 changhiskhan

Closing due to inactivity. Compression is on the roadmap. Once we have a more concrete design, we'll open a new issue with more specific details for discussion

changhiskhan avatar Jul 02 '23 08:07 changhiskhan

Any update on compression for lance files? I have a parquet file with roughly 25 million rows. There are 388 columns with the following dtypes:

dtype count
int32 365
categorical (large_string -> int32) 11
double 7
bool 2
int64 1
date32[day] 1
large_string 1

With LZ4 compression the parquet file is about 5.6 GB, but with the default settings in lance I get 26 .lance files adding up to 40.3 GB.

Love the project and appreciate your work on it!

benmayersohn avatar Mar 05 '24 14:03 benmayersohn