io tfio.IODataset.from_parquet load array element failed

tfio.IODataset.from_parquet load array element failed

Open dpwang95 opened this issue 2 years ago • 18 comments

test_data = pd.DataFrame({'a':[[1,2,3],[4,5,6]], 'b':['q','p']})
test_data.to_parquet('a.parquet')
another_data = tfio.experimental.IODataset.from_parquet('a.parquet').as_numpy_iterator()
another_data.next()

result is

OrderedDict([('a.list.item', 1), ('b', b'q')])

seems only load the first element of array

Nov 16 '21 07:11 dpwang95

I have the same demand for this case, when I use tfio.IODataset.from_parquet to load same parquet file with array element, it always report "tensorflow.python.framework.errors_impl.InvalidArgumentError: slice index 0 of dimension 0 out of bounds. [Op:StridedSlice] name: IOFromParquet/ParquetIODataset/strided_slice/"

Jan 04 '22 06:01 GitWhilebear

Same problem.

df = pd.DataFrame({
    "scores": [[.1,.2,.3], [.4,.5,.6], [.7,.8,.9]],
})
df.to_parquet("test.parquet")
ds = tfio.IODataset.from_parquet("test.parquet")
for el in ds:
    print(el)
print(ds.element_spec)

gives

OrderedDict([(b'scores.list.item', <tf.Tensor: shape=(), dtype=float64, numpy=0.1>)])
OrderedDict([(b'scores.list.item', <tf.Tensor: shape=(), dtype=float64, numpy=0.2>)])
OrderedDict([(b'scores.list.item', <tf.Tensor: shape=(), dtype=float64, numpy=0.3>)])
OrderedDict([(b'scores.list.item', TensorSpec(shape=(), dtype=tf.float64, name=None))])

tfio version is 0.24.0

Feb 17 '22 08:02 kvarekamp

+1 Same issue. Anyone found a resolution?

Feb 18 '22 21:02 arash2060

+1 same issue.

Feb 21 '22 18:02 wangxj03

+1 same issue

Feb 23 '22 20:02 deanoserrentino

+1 same issue

Mar 15 '22 23:03 deanbudd

Having the same issue. I'd love to be using IODataset.from_parquet instead of putting together a custom generator and creating a dataset from that. Any idea if/when this issue will be picked up?

Mar 28 '22 18:03 dholland42

+1 same issue. Has anyone found the right way to use it?

Apr 25 '22 07:04 Trangle

+1 same issue.

Since I am working in Databricks/PySpark, I will likely use the petastorm library to load the parquet files into a TensorFlow dataset using their make_petastorm_dataset() wrapped around their make_batch_reader() function. This is not really fixing the problem with tensorflow IO but could be an option for some of you. And I'll gladly be kept in the loop for a solution!