lance icon indicating copy to clipboard operation
lance copied to clipboard

no encodings/compression when converting parquet file to lance

Open shenlei149 opened this issue 1 year ago • 2 comments

We have a lot of data for training, and now we use parquet format to store them. Lance says it can support very wide schemas to solve the issue we have to read all metadata from parquet footer even if only few columns projects.

I write the following code to convert parquet file to lance file, but I find the lance file size is very very large. It is almost the same size as the original data and may not use any encoding or compression. Is there something wrong?

import lance
import pyarrow.parquet as pq

src = "A0.parquet"
dst = "A0.lance"

table = pq.read_table(src)
schema = table.schema

lance.write_dataset(table, dst, schema = schema, use_legacy_format=False)

shenlei149 avatar Dec 24 '24 08:12 shenlei149

hi @shenlei149 could you share your schema? that would be helpful for investigating!

BubbleCal avatar Dec 30 '24 17:12 BubbleCal

could you share your schema? that would be helpful for investigating!

you can use this example file to reproduce issue. https://drive.google.com/file/d/1VO2bEW_Q8lJamFQ4Id3WhLsxbua8x2oq/view?usp=drive_link

shenlei149 avatar Dec 31 '24 04:12 shenlei149

@BubbleCal any updates on this?

shenlei149 avatar Mar 28 '25 07:03 shenlei149

So 2.0 isn't going to give you much compression. Compression was more thoroughly addressed in 2.1 which is still in beta. I just tried this file against 2.1 and it certainly is a fun one (300 MB/row, 14K columns!)

Looks like we're currently hitting a bug with boolean columns.

I'll work on it this week to see if I can get it working correctly so you can have a more valid comparison.

westonpace avatar Mar 31 '25 23:03 westonpace