Add support for orc format
- [x] Ability to read orc file format based Iceberg Table
- [ ] Ability to create orc file format based Iceberg Table
- [ ] Unit Test
- [x] Integration Test for Reads
- [ ] Integration Test for Writes
Hi @Fokko and @HonahX ✅ I have modified the read logic to read the orc file-based iceberg table and wrote an integration test too it is working great.
Would love Some guidance on:
I could find a way to create an orc file-based iceberg table via glue client(by passing the properties with format=orc)
But this is still making parquet data files when I am appending the data ( Is it due to datafile and deletefile logic that they are by default taking parquet file format)
from pyiceberg.catalog import load_catalog
from decimal import Decimal
import pyarrow as pa
catalog = load_catalog("default") #my default catalog is glue
namespace = 'demo_ns'
table_name = 'test_table_dummy_orc_demo'
pylist = [{'decimal_col': Decimal('32768.1'), 'int_col': 1, 'string_col': "demo_one"},
{'decimal_col': Decimal('44456.1'), 'int_col': 2, 'string_col': "demo_two"}]
arrow_schema = pa.schema(
[
pa.field('decimal_col', pa.decimal128(33, 1)),
pa.field('int_col', pa.int32()),
pa.field('string_col', pa.string()),
],
)
arrow_table = pa.Table.from_pylist(pylist, schema=arrow_schema)
new_table = catalog.create_table(
identifier=f'{namespace}.{table_name}',
schema=arrow_schema,
properties={
'format': 'orc'
}
table.append(arrow_table)
I believe we need to make a change in this write_file method to support ORC writes, as the link goes
write_file->dataframe_to_datafile->append || overwrite
at the end it is called by the user, please correct me if I am going towards the wrong direction https://github.com/apache/iceberg-python/blob/a11036873990cd9c8aae2c8af667e2974f4bac9d/pyiceberg/io/pyarrow.py#L1788
Hi @MehulBatra. Thanks for taking this! It looks like a great start.
I believe we need to make a change in this write_file method to support ORC writes, as the link goes
Yes, I think this is the right place to add the ORC write logic. As you have noticed, the datafile format is controlled by the table property write.default.format. Currently we do not support this property in pyiceberg since we assume the format is parquet.
We can add the property in https://github.com/apache/iceberg-python/blob/c4feda5db83cfb230caefa124d7a8f2600d920f7/pyiceberg/table/init.py#L206-L211 and doc it here:https://github.com/apache/iceberg-python/blob/94e8a9835995e3b61f07f0dfb48d8a22a1e1d1b0/mkdocs/docs/configuration.md?plain=1#L53-L63
In the write_file, we check the write.default.format property and write to the correct format. For statistics, we may need a data_file_statistics_from_orc similar to https://github.com/apache/iceberg-python/blob/e61ef5770b4d73e683e2c78bebdd6c2165102a6b/pyiceberg/io/pyarrow.py#L1674-L1698
(we can make statistics collection as a follow-up feature since most statistics fields are optional)
Thanks, @HonahX for the feedback, I will consider all this while moving forward!
I've added some comments for the read side.
We may try to merge the read support first and make write support a separate PR. WDYT?
@HonahX Works for me and I believe it will also benefit the community to get unblocked at least on the read side meanwhile we can grind on the write support, I have already started on the write support changes I will raise a separate PR for the same and attach it on #20