delta-rs
delta-rs copied to clipboard
Support string and binary view types.
Arrow has adopted the (IMO) much better binary and string view types. Supporting these would mean we could move data zero-copy to delta-rs. Currently it fails:
df = pl.DataFrame({"a": ["foo", "bar"]})
df.write_delta("/tmp/bar.bar", mode="overwrite")
---------------------------------------------------------------------------
ArrowNotImplementedError Traceback (most recent call last)
Cell In[5], line 2
1 df = pl.DataFrame({"a": ["foo", "bar"]})
----> 2 df.write_delta("/tmp/bar.bar", mode="overwrite")
File ~/code/polars/py-polars/polars/dataframe/frame.py:3968, in DataFrame.write_delta(self, target, mode, overwrite_schema, storage_options, delta_write_options, delta_merge_options)
3965 delta_write_options["schema_mode"] = "overwrite"
3967 schema = delta_write_options.pop("schema", None)
-> 3968 write_deltalake(
3969 table_or_uri=target,
3970 data=data,
3971 schema=schema,
3972 mode=mode,
3973 storage_options=storage_options,
3974 large_dtypes=True,
3975 **delta_write_options,
3976 )
3977 return None
File ~/miniconda3/lib/python3.10/site-packages/deltalake/writer.py:519, in write_deltalake(table_or_uri, data, schema, partition_by, mode, file_options, max_partitions, max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group, name, description, configuration, overwrite_schema, schema_mode, storage_options, partition_filters, predicate, large_dtypes, engine, writer_properties, custom_metadata)
514 else:
515 file_options = ds.ParquetFileFormat().make_write_options(
516 use_compliant_nested_type=False
517 )
--> 519 ds.write_dataset(
520 data,
521 base_dir="/",
522 basename_template=f"{current_version + 1}-{uuid.uuid4()}-{{i}}.parquet",
523 format="parquet",
524 partitioning=partitioning,
525 # It will not accept a schema if using a RBR
526 schema=schema if not isinstance(data, RecordBatchReader) else None,
527 file_visitor=visitor,
528 existing_data_behavior="overwrite_or_ignore",
529 file_options=file_options,
530 max_open_files=max_open_files,
531 max_rows_per_file=max_rows_per_file,
532 min_rows_per_group=min_rows_per_group,
533 max_rows_per_group=max_rows_per_group,
534 filesystem=filesystem,
535 max_partitions=max_partitions,
536 )
538 if table is None:
539 write_deltalake_pyarrow(
540 table_uri,
541 schema,
(...)
549 custom_metadata,
550 )
File ~/miniconda3/lib/python3.10/site-packages/pyarrow/dataset.py:1030, in write_dataset(data, base_dir, basename_template, format, partitioning, partitioning_flavor, schema, filesystem, file_options, use_threads, max_partitions, max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group, file_visitor, existing_data_behavior, create_dir)
1027 raise ValueError("Cannot specify a schema when writing a Scanner")
1028 scanner = data
-> 1030 _filesystemdataset_write(
1031 scanner, base_dir, basename_template, filesystem, partitioning,
1032 file_options, max_partitions, file_visitor, existing_data_behavior,
1033 max_open_files, max_rows_per_file,
1034 min_rows_per_group, max_rows_per_group, create_dir
1035 )
File ~/miniconda3/lib/python3.10/site-packages/pyarrow/_dataset.pyx:4010, in pyarrow._dataset._filesystemdataset_write()
File ~/miniconda3/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: string_view
@ritchie46 That's the pyarrow writer engine, we don't really have control there to change this. Our Rust writer relies on the arrow_cast crate, which supports utf8_view https://docs.rs/arrow-cast/52.0.0/src/arrow_cast/cast/mod.rs.html#709.
Can you check with delta_write_options = {"engine":"rust"}? This should at least run, but it will probably cast utf8_view to utf8 in the rust writer since our delta schema to arrow schema naively translates primitive string to arrow utf8.
Also which pyarrow version did you use here?
Also which pyarrow version did you use here?
Pyarrow 16
Can you check with delta_write_options = {"engine":"rust"}? This should at least run, but it will probably cast utf8_view to utf8 in the rust writer since our delta schema to arrow schema naively translates primitive string to arrow utf8.
Hmm.. the whole cast is the thing I want to circumvent. :/ Are there still impediments on the Rust side?
Also which pyarrow version did you use here?
Pyarrow 16
Can you check with delta_write_options = {"engine":"rust"}? This should at least run, but it will probably cast utf8_view to utf8 in the rust writer since our delta schema to arrow schema naively translates primitive string to arrow utf8.
Hmm.. the whole cast is the thing I want to circumvent. :/ Are there still impediments on the Rust side?
I understand:) On the rust side, we should allow utf8-view to be passed through. But now we always cast record batches with the delta schema that gets converted to an arrow schema.
There we don't have a way yet to allow delta string to either be arrow utf8 or large utf8 or utf8 view. It will always be arrow utf8, which is also a current problem if your source has a large arrow array that's too large..