no encodings/compression when converting parquet file to lance
We have a lot of data for training, and now we use parquet format to store them. Lance says it can support very wide schemas to solve the issue we have to read all metadata from parquet footer even if only few columns projects.
I write the following code to convert parquet file to lance file, but I find the lance file size is very very large. It is almost the same size as the original data and may not use any encoding or compression. Is there something wrong?
import lance
import pyarrow.parquet as pq
src = "A0.parquet"
dst = "A0.lance"
table = pq.read_table(src)
schema = table.schema
lance.write_dataset(table, dst, schema = schema, use_legacy_format=False)
hi @shenlei149 could you share your schema? that would be helpful for investigating!
could you share your schema? that would be helpful for investigating!
you can use this example file to reproduce issue. https://drive.google.com/file/d/1VO2bEW_Q8lJamFQ4Id3WhLsxbua8x2oq/view?usp=drive_link
@BubbleCal any updates on this?
So 2.0 isn't going to give you much compression. Compression was more thoroughly addressed in 2.1 which is still in beta. I just tried this file against 2.1 and it certainly is a fun one (300 MB/row, 14K columns!)
Looks like we're currently hitting a bug with boolean columns.
I'll work on it this week to see if I can get it working correctly so you can have a more valid comparison.