io icon indicating copy to clipboard operation
io copied to clipboard

String types not supported via tensorflow_io.arrow

Open JovanVeljanoski opened this issue 4 years ago • 6 comments

Hello,

We (@maartenbreddels and I) are attempting to use tensorflow_io.arrow to feed arrow data to a tensorflow Estimator, and are encountering issues when the input features are of type string.

This is the error that is being raised: TypeError: Unsupported type in conversion from Arrow: string I am attaching a fully reproducible example below.

I assume this is a known issue or a known missing feature. Can you please let us know if there are any known workarounds for this, and whether addressing this is in your (short term) roadmap.

Thank you! Jovan.

Reproducible example: pyarrow version: 0.16.0 tensorflow_io: 0.12.0

from functools import partial
import pyarrow
import pandas as pd
import tensorflow as tf
import tensorflow_io.arrow as arrow_io
from tensorflow_io.arrow.python.ops.arrow_dataset_ops import arrow_schema_to_tensor_types


# Get the training data
df = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')

# Simplify by selecting only a few columns
features = ['sex', 'class', 'fare']
df = df[features]

# For example choosing these numerical columns only - everything works
# features = ['fare', 'age', 'n_siblings_spouses']
# df = df[features]

# Convert to pyarrow Table
table = pyarrow.Table.from_pandas(df, preserve_index=False)

def generator(chunk_size):
    for batch in table.to_batches(chunk_size):
        yield batch

def get_batch_arrow_schema(arrow_batch):
    output_types, output_shapes = arrow_schema_to_tensor_types(arrow_batch.schema)
    return output_types, output_shapes

def to_dataset(batch_size=32):
    # Set up the iterator factory - for convenience
    iterator_factory = partial(generator, batch_size)
    # Get the arrow schema
    output_types, output_shapes = get_batch_arrow_schema(next(iterator_factory()))
    
    # Define the TF dataset 
    ds = arrow_io.ArrowStreamDataset.from_record_batches(record_batch_iter=iterator_factory(),
                                                         output_types=output_types, 
                                                         output_shapes=output_shapes, 
                                                         batch_mode='auto', 
                                                         record_batch_iter_factory=iterator_factory)
    
    # Reshape the data into the appropriate format
    ds = ds.map(lambda *tensors: (dict(zip(features, tensors[:-1])), tensors[-1]))
    
    return ds

ds = to_dataset()   # Raises: "TypeError: Unsupported type in conversion from Arrow: string"

JovanVeljanoski avatar Mar 09 '20 11:03 JovanVeljanoski

/cc @BryanCutler to take a look.

yongtang avatar Mar 09 '20 17:03 yongtang

Thanks @JovanVeljanoski , this is something that has been on my todo list for a while but I just haven't had the time. As a workaround, could you encode your string features to a numeric array? PyArrow has support for dictionary encoding for this, only that tensorflow_io.arrow won't be able to use the dictionary batch.

BryanCutler avatar Mar 09 '20 17:03 BryanCutler

Hi @BryanCutler, thank you for your response. Indeed, we can encode the string features prior to passing them to tensorflow_io.arrow, but I was hoping to leverage some of the tf.feature_column hashing/embedding options. In any case I was just wondering what the status was.

Thank you for the great work on all this! Please let us know if/when you decide to tackle this feature. Cheers!

JovanVeljanoski avatar Mar 11 '20 10:03 JovanVeljanoski

Hello I think following those allowed it: https://github.com/tensorflow/io/pull/1092#pullrequestreview-469875844 however it does not work with batch_size not set to 1 on the pyarrow side:

from functools import partial
import pyarrow
import pandas as pd
import tensorflow as tf
import tensorflow_io.arrow as arrow_io
from tensorflow_io.arrow.python.ops.arrow_dataset_ops import arrow_schema_to_tensor_types


# Get the training data
df = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')

# Simplify by selecting only a few columns
features = ['sex', 'class', 'fare']
df = df[features]

# For example choosing these numerical columns only - everything works
# features = ['fare', 'age', 'n_siblings_spouses']
# df = df[features]

# Convert to pyarrow Table
table = pyarrow.Table.from_pandas(df, preserve_index=False)

def generator(chunk_size):
    for batch in table.to_batches(chunk_size):
        yield batch

def get_batch_arrow_schema(arrow_batch):
    output_types, output_shapes = arrow_schema_to_tensor_types(arrow_batch.schema)
    return output_types, output_shapes

def to_dataset(batch_size=1):
    # Set up the iterator factory - for convenience
    iterator_factory = partial(generator, batch_size)
    # Get the arrow schema
    output_types, output_shapes = get_batch_arrow_schema(next(iterator_factory()))
    
    # Define the TF dataset 
    ds = arrow_io.ArrowStreamDataset.from_record_batches(record_batch_iter=iterator_factory(),
                                                         output_types=output_types, 
                                                         output_shapes=output_shapes, 
                                                         batch_size=40,
                                                         batch_mode='drop_remainder', 
                                                         record_batch_iter_factory=iterator_factory)
    
    # Reshape the data into the appropriate format
    #ds = ds.map(lambda *tensors: (dict(zip(features, tensors[:-1])), tensors[-1]))
    
    return ds

ds = to_dataset()
for elem in ds:
  print(elem[0].shape) # (size 40)

Otherwise on Collab I get: ```Check failed: 1 == NumElements() (1 vs. 32)Must have a one element tensor```` from tensorflow/core/framework/tensor.cc:673]. as a segfault

Smallest repro:

import tensorflow_io.arrow as arrow_io
import pyarrow
import tensorflow as tf
aa = pyarrow.array(['a', 'bb', 'ccc'])
ubb = pyarrow.Table.from_pydict({'uu': aa}).to_batches(2) # if set here to 1 it's fine.
ads = arrow_io.ArrowDataset.from_record_batches(ubb, columns=(0,), 
                                                output_types=(tf.string,), batch_mode='auto')
dd = next(iter(ads))
```

tanguycdls avatar Mar 29 '21 14:03 tanguycdls

Thanks @JovanVeljanoski , this is something that has been on my todo list for a while but I just haven't had the time. As a workaround, could you encode your string features to a numeric array? PyArrow has support for dictionary encoding for this, only that tensorflow_io.arrow won't be able to use the dictionary batch.

Thanks @BryanCutler for the great work to bring arrow to tensorflow. I am currently using it and have the same problem for tf.string. Can you shed some light on what cause this issue? Maybe I can take it over and fix it.

austinzh avatar Jun 27 '21 00:06 austinzh

@tanguycdls @BryanCutler I created a PR for this feature, feel free to check it out. https://github.com/tensorflow/io/pull/1472

austinzh avatar Jul 12 '21 19:07 austinzh