delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Support string and binary view types.

Open ritchie46 opened this issue 1 year ago • 3 comments

Arrow has adopted the (IMO) much better binary and string view types. Supporting these would mean we could move data zero-copy to delta-rs. Currently it fails:

df = pl.DataFrame({"a": ["foo", "bar"]})
df.write_delta("/tmp/bar.bar", mode="overwrite")
---------------------------------------------------------------------------
ArrowNotImplementedError                  Traceback (most recent call last)
Cell In[5], line 2
      1 df = pl.DataFrame({"a": ["foo", "bar"]})
----> 2 df.write_delta("/tmp/bar.bar", mode="overwrite")

File ~/code/polars/py-polars/polars/dataframe/frame.py:3968, in DataFrame.write_delta(self, target, mode, overwrite_schema, storage_options, delta_write_options, delta_merge_options)
   3965     delta_write_options["schema_mode"] = "overwrite"
   3967 schema = delta_write_options.pop("schema", None)
-> 3968 write_deltalake(
   3969     table_or_uri=target,
   3970     data=data,
   3971     schema=schema,
   3972     mode=mode,
   3973     storage_options=storage_options,
   3974     large_dtypes=True,
   3975     **delta_write_options,
   3976 )
   3977 return None

File ~/miniconda3/lib/python3.10/site-packages/deltalake/writer.py:519, in write_deltalake(table_or_uri, data, schema, partition_by, mode, file_options, max_partitions, max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group, name, description, configuration, overwrite_schema, schema_mode, storage_options, partition_filters, predicate, large_dtypes, engine, writer_properties, custom_metadata)
    514 else:
    515     file_options = ds.ParquetFileFormat().make_write_options(
    516         use_compliant_nested_type=False
    517     )
--> 519 ds.write_dataset(
    520     data,
    521     base_dir="/",
    522     basename_template=f"{current_version + 1}-{uuid.uuid4()}-{{i}}.parquet",
    523     format="parquet",
    524     partitioning=partitioning,
    525     # It will not accept a schema if using a RBR
    526     schema=schema if not isinstance(data, RecordBatchReader) else None,
    527     file_visitor=visitor,
    528     existing_data_behavior="overwrite_or_ignore",
    529     file_options=file_options,
    530     max_open_files=max_open_files,
    531     max_rows_per_file=max_rows_per_file,
    532     min_rows_per_group=min_rows_per_group,
    533     max_rows_per_group=max_rows_per_group,
    534     filesystem=filesystem,
    535     max_partitions=max_partitions,
    536 )
    538 if table is None:
    539     write_deltalake_pyarrow(
    540         table_uri,
    541         schema,
   (...)
    549         custom_metadata,
    550     )

File ~/miniconda3/lib/python3.10/site-packages/pyarrow/dataset.py:1030, in write_dataset(data, base_dir, basename_template, format, partitioning, partitioning_flavor, schema, filesystem, file_options, use_threads, max_partitions, max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group, file_visitor, existing_data_behavior, create_dir)
   1027         raise ValueError("Cannot specify a schema when writing a Scanner")
   1028     scanner = data
-> 1030 _filesystemdataset_write(
   1031     scanner, base_dir, basename_template, filesystem, partitioning,
   1032     file_options, max_partitions, file_visitor, existing_data_behavior,
   1033     max_open_files, max_rows_per_file,
   1034     min_rows_per_group, max_rows_per_group, create_dir
   1035 )

File ~/miniconda3/lib/python3.10/site-packages/pyarrow/_dataset.pyx:4010, in pyarrow._dataset._filesystemdataset_write()

File ~/miniconda3/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: string_view

ritchie46 avatar Jun 20 '24 07:06 ritchie46

@ritchie46 That's the pyarrow writer engine, we don't really have control there to change this. Our Rust writer relies on the arrow_cast crate, which supports utf8_view https://docs.rs/arrow-cast/52.0.0/src/arrow_cast/cast/mod.rs.html#709.

Can you check with delta_write_options = {"engine":"rust"}? This should at least run, but it will probably cast utf8_view to utf8 in the rust writer since our delta schema to arrow schema naively translates primitive string to arrow utf8.

Also which pyarrow version did you use here?

ion-elgreco avatar Jun 20 '24 08:06 ion-elgreco

Also which pyarrow version did you use here?

Pyarrow 16

Can you check with delta_write_options = {"engine":"rust"}? This should at least run, but it will probably cast utf8_view to utf8 in the rust writer since our delta schema to arrow schema naively translates primitive string to arrow utf8.

Hmm.. the whole cast is the thing I want to circumvent. :/ Are there still impediments on the Rust side?

ritchie46 avatar Jun 20 '24 10:06 ritchie46

Also which pyarrow version did you use here?

Pyarrow 16

Can you check with delta_write_options = {"engine":"rust"}? This should at least run, but it will probably cast utf8_view to utf8 in the rust writer since our delta schema to arrow schema naively translates primitive string to arrow utf8.

Hmm.. the whole cast is the thing I want to circumvent. :/ Are there still impediments on the Rust side?

I understand:) On the rust side, we should allow utf8-view to be passed through. But now we always cast record batches with the delta schema that gets converted to an arrow schema.

There we don't have a way yet to allow delta string to either be arrow utf8 or large utf8 or utf8 view. It will always be arrow utf8, which is also a current problem if your source has a large arrow array that's too large..

ion-elgreco avatar Jun 20 '24 10:06 ion-elgreco