lance no encodings/compression when converting parquet file to lance

We have a lot of data for training, and now we use parquet format to store them. Lance says it can support very wide schemas to solve the issue we have to read all metadata from parquet footer even if only few columns projects.

I write the following code to convert parquet file to lance file, but I find the lance file size is very very large. It is almost the same size as the original data and may not use any encoding or compression. Is there something wrong?

import lance
import pyarrow.parquet as pq

src = "A0.parquet"
dst = "A0.lance"

table = pq.read_table(src)
schema = table.schema

lance.write_dataset(table, dst, schema = schema, use_legacy_format=False)

Dec 24 '24 08:12 shenlei149

hi @shenlei149 could you share your schema? that would be helpful for investigating!

Dec 30 '24 17:12 BubbleCal

could you share your schema? that would be helpful for investigating!

you can use this example file to reproduce issue. https://drive.google.com/file/d/1VO2bEW_Q8lJamFQ4Id3WhLsxbua8x2oq/view?usp=drive_link

Dec 31 '24 04:12 shenlei149

@BubbleCal any updates on this?

Mar 28 '25 07:03 shenlei149

So 2.0 isn't going to give you much compression. Compression was more thoroughly addressed in 2.1 which is still in beta. I just tried this file against 2.1 and it certainly is a fun one (300 MB/row, 14K columns!)

Looks like we're currently hitting a bug with boolean columns.

I'll work on it this week to see if I can get it working correctly so you can have a more valid comparison.

Mar 31 '25 23:03 westonpace