iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Add support for orc format

Open MehulBatra opened this issue 1 year ago • 5 comments

  • [x] Ability to read orc file format based Iceberg Table
  • [ ] Ability to create orc file format based Iceberg Table
  • [ ] Unit Test
  • [x] Integration Test for Reads
  • [ ] Integration Test for Writes

MehulBatra avatar Jun 03 '24 17:06 MehulBatra

Hi @Fokko and @HonahX ✅ I have modified the read logic to read the orc file-based iceberg table and wrote an integration test too it is working great.

Would love Some guidance on:

I could find a way to create an orc file-based iceberg table via glue client(by passing the properties with format=orc)

But this is still making parquet data files when I am appending the data ( Is it due to datafile and deletefile logic that they are by default taking parquet file format)

from pyiceberg.catalog import load_catalog
from decimal import Decimal
import pyarrow as pa

catalog = load_catalog("default") #my default catalog is glue
namespace = 'demo_ns'
table_name = 'test_table_dummy_orc_demo'
pylist = [{'decimal_col': Decimal('32768.1'), 'int_col': 1, 'string_col': "demo_one"},
          {'decimal_col': Decimal('44456.1'), 'int_col': 2, 'string_col': "demo_two"}]
arrow_schema = pa.schema(
    [
        pa.field('decimal_col', pa.decimal128(33, 1)),
        pa.field('int_col',  pa.int32()),
        pa.field('string_col', pa.string()),
    ],
)
arrow_table = pa.Table.from_pylist(pylist, schema=arrow_schema)
new_table = catalog.create_table(
    identifier=f'{namespace}.{table_name}',
    schema=arrow_schema,
    properties={
        'format': 'orc'
    }

table.append(arrow_table)

MehulBatra avatar Jun 05 '24 07:06 MehulBatra

I believe we need to make a change in this write_file method to support ORC writes, as the link goes

write_file->dataframe_to_datafile->append || overwrite

at the end it is called by the user, please correct me if I am going towards the wrong direction https://github.com/apache/iceberg-python/blob/a11036873990cd9c8aae2c8af667e2974f4bac9d/pyiceberg/io/pyarrow.py#L1788

MehulBatra avatar Jun 05 '24 15:06 MehulBatra

Hi @MehulBatra. Thanks for taking this! It looks like a great start.

I believe we need to make a change in this write_file method to support ORC writes, as the link goes

Yes, I think this is the right place to add the ORC write logic. As you have noticed, the datafile format is controlled by the table property write.default.format. Currently we do not support this property in pyiceberg since we assume the format is parquet.

We can add the property in https://github.com/apache/iceberg-python/blob/c4feda5db83cfb230caefa124d7a8f2600d920f7/pyiceberg/table/init.py#L206-L211 and doc it here:https://github.com/apache/iceberg-python/blob/94e8a9835995e3b61f07f0dfb48d8a22a1e1d1b0/mkdocs/docs/configuration.md?plain=1#L53-L63

In the write_file, we check the write.default.format property and write to the correct format. For statistics, we may need a data_file_statistics_from_orc similar to https://github.com/apache/iceberg-python/blob/e61ef5770b4d73e683e2c78bebdd6c2165102a6b/pyiceberg/io/pyarrow.py#L1674-L1698 (we can make statistics collection as a follow-up feature since most statistics fields are optional)

HonahX avatar Jun 10 '24 07:06 HonahX

Thanks, @HonahX for the feedback, I will consider all this while moving forward!

MehulBatra avatar Jun 10 '24 19:06 MehulBatra

I've added some comments for the read side.

We may try to merge the read support first and make write support a separate PR. WDYT?

@HonahX Works for me and I believe it will also benefit the community to get unblocked at least on the read side meanwhile we can grind on the write support, I have already started on the write support changes I will raise a separate PR for the same and attach it on #20

MehulBatra avatar Jun 17 '24 08:06 MehulBatra