io
io copied to clipboard
String types not supported via tensorflow_io.arrow
Hello,
We (@maartenbreddels and I) are attempting to use tensorflow_io.arrow
to feed arrow data to a tensorflow Estimator, and are encountering issues when the input features are of type string.
This is the error that is being raised:
TypeError: Unsupported type in conversion from Arrow: string
I am attaching a fully reproducible example below.
I assume this is a known issue or a known missing feature. Can you please let us know if there are any known workarounds for this, and whether addressing this is in your (short term) roadmap.
Thank you! Jovan.
Reproducible example: pyarrow version: 0.16.0 tensorflow_io: 0.12.0
from functools import partial
import pyarrow
import pandas as pd
import tensorflow as tf
import tensorflow_io.arrow as arrow_io
from tensorflow_io.arrow.python.ops.arrow_dataset_ops import arrow_schema_to_tensor_types
# Get the training data
df = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
# Simplify by selecting only a few columns
features = ['sex', 'class', 'fare']
df = df[features]
# For example choosing these numerical columns only - everything works
# features = ['fare', 'age', 'n_siblings_spouses']
# df = df[features]
# Convert to pyarrow Table
table = pyarrow.Table.from_pandas(df, preserve_index=False)
def generator(chunk_size):
for batch in table.to_batches(chunk_size):
yield batch
def get_batch_arrow_schema(arrow_batch):
output_types, output_shapes = arrow_schema_to_tensor_types(arrow_batch.schema)
return output_types, output_shapes
def to_dataset(batch_size=32):
# Set up the iterator factory - for convenience
iterator_factory = partial(generator, batch_size)
# Get the arrow schema
output_types, output_shapes = get_batch_arrow_schema(next(iterator_factory()))
# Define the TF dataset
ds = arrow_io.ArrowStreamDataset.from_record_batches(record_batch_iter=iterator_factory(),
output_types=output_types,
output_shapes=output_shapes,
batch_mode='auto',
record_batch_iter_factory=iterator_factory)
# Reshape the data into the appropriate format
ds = ds.map(lambda *tensors: (dict(zip(features, tensors[:-1])), tensors[-1]))
return ds
ds = to_dataset() # Raises: "TypeError: Unsupported type in conversion from Arrow: string"
/cc @BryanCutler to take a look.
Thanks @JovanVeljanoski , this is something that has been on my todo list for a while but I just haven't had the time. As a workaround, could you encode your string features to a numeric array? PyArrow has support for dictionary encoding for this, only that tensorflow_io.arrow
won't be able to use the dictionary batch.
Hi @BryanCutler, thank you for your response. Indeed, we can encode the string features prior to passing them to tensorflow_io.arrow
, but I was hoping to leverage some of the tf.feature_column
hashing/embedding options. In any case I was just wondering what the status was.
Thank you for the great work on all this! Please let us know if/when you decide to tackle this feature. Cheers!
Hello I think following those allowed it: https://github.com/tensorflow/io/pull/1092#pullrequestreview-469875844 however it does not work with batch_size not set to 1 on the pyarrow side:
from functools import partial
import pyarrow
import pandas as pd
import tensorflow as tf
import tensorflow_io.arrow as arrow_io
from tensorflow_io.arrow.python.ops.arrow_dataset_ops import arrow_schema_to_tensor_types
# Get the training data
df = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
# Simplify by selecting only a few columns
features = ['sex', 'class', 'fare']
df = df[features]
# For example choosing these numerical columns only - everything works
# features = ['fare', 'age', 'n_siblings_spouses']
# df = df[features]
# Convert to pyarrow Table
table = pyarrow.Table.from_pandas(df, preserve_index=False)
def generator(chunk_size):
for batch in table.to_batches(chunk_size):
yield batch
def get_batch_arrow_schema(arrow_batch):
output_types, output_shapes = arrow_schema_to_tensor_types(arrow_batch.schema)
return output_types, output_shapes
def to_dataset(batch_size=1):
# Set up the iterator factory - for convenience
iterator_factory = partial(generator, batch_size)
# Get the arrow schema
output_types, output_shapes = get_batch_arrow_schema(next(iterator_factory()))
# Define the TF dataset
ds = arrow_io.ArrowStreamDataset.from_record_batches(record_batch_iter=iterator_factory(),
output_types=output_types,
output_shapes=output_shapes,
batch_size=40,
batch_mode='drop_remainder',
record_batch_iter_factory=iterator_factory)
# Reshape the data into the appropriate format
#ds = ds.map(lambda *tensors: (dict(zip(features, tensors[:-1])), tensors[-1]))
return ds
ds = to_dataset()
for elem in ds:
print(elem[0].shape) # (size 40)
Otherwise on Collab I get: ```Check failed: 1 == NumElements() (1 vs. 32)Must have a one element tensor```` from tensorflow/core/framework/tensor.cc:673]. as a segfault
Smallest repro:
import tensorflow_io.arrow as arrow_io
import pyarrow
import tensorflow as tf
aa = pyarrow.array(['a', 'bb', 'ccc'])
ubb = pyarrow.Table.from_pydict({'uu': aa}).to_batches(2) # if set here to 1 it's fine.
ads = arrow_io.ArrowDataset.from_record_batches(ubb, columns=(0,),
output_types=(tf.string,), batch_mode='auto')
dd = next(iter(ads))
```
Thanks @JovanVeljanoski , this is something that has been on my todo list for a while but I just haven't had the time. As a workaround, could you encode your string features to a numeric array? PyArrow has support for dictionary encoding for this, only that
tensorflow_io.arrow
won't be able to use the dictionary batch.
Thanks @BryanCutler for the great work to bring arrow to tensorflow.
I am currently using it and have the same problem for tf.string
. Can you shed some light on what cause this issue?
Maybe I can take it over and fix it.
@tanguycdls @BryanCutler I created a PR for this feature, feel free to check it out. https://github.com/tensorflow/io/pull/1472