lance
lance copied to clipboard
apply compression to lance files
parquet files can be natively compressed using zstd or other mechanisms. Is it possible to apply compression to lance files?
currently we haven't focused on compression. It will take a little more work to enable compression but still preserve random access performance. On the roadmap for sure though
Our parquet files are huge compression is definetly a needed thing.
Our parquet files are huge compression is definetly a needed thing.
Are you able to post the distribution here? Like what columns and types, any estimates for how much each column takes etc? That would definitely help us prioritize.
Also, you have timestamp data right? How regular is the interval? Eg hardware sensors tend to generate very regular intervals. Click streams generally have small and irregular intervals etc.
Thanks!
Closing due to inactivity. Compression is on the roadmap. Once we have a more concrete design, we'll open a new issue with more specific details for discussion
Any update on compression for lance files? I have a parquet file with roughly 25 million rows. There are 388 columns with the following dtypes:
dtype | count |
---|---|
int32 | 365 |
categorical (large_string -> int32) | 11 |
double | 7 |
bool | 2 |
int64 | 1 |
date32[day] | 1 |
large_string | 1 |
With LZ4 compression the parquet file is about 5.6 GB, but with the default settings in lance I get 26 .lance
files adding up to 40.3 GB.
Love the project and appreciate your work on it!