io icon indicating copy to clipboard operation
io copied to clipboard

tfio.IODataset.from_parquet load array element failed

Open dpwang95 opened this issue 2 years ago • 18 comments

test_data = pd.DataFrame({'a':[[1,2,3],[4,5,6]], 'b':['q','p']})
test_data.to_parquet('a.parquet')
another_data = tfio.experimental.IODataset.from_parquet('a.parquet').as_numpy_iterator()
another_data.next()

result is

OrderedDict([('a.list.item', 1), ('b', b'q')])

seems only load the first element of array

dpwang95 avatar Nov 16 '21 07:11 dpwang95

I have the same demand for this case, when I use tfio.IODataset.from_parquet to load same parquet file with array element, it always report "tensorflow.python.framework.errors_impl.InvalidArgumentError: slice index 0 of dimension 0 out of bounds. [Op:StridedSlice] name: IOFromParquet/ParquetIODataset/strided_slice/"

GitWhilebear avatar Jan 04 '22 06:01 GitWhilebear

Same problem.

df = pd.DataFrame({
    "scores": [[.1,.2,.3], [.4,.5,.6], [.7,.8,.9]],
})
df.to_parquet("test.parquet")
ds = tfio.IODataset.from_parquet("test.parquet")
for el in ds:
    print(el)
print(ds.element_spec)

gives

OrderedDict([(b'scores.list.item', <tf.Tensor: shape=(), dtype=float64, numpy=0.1>)])
OrderedDict([(b'scores.list.item', <tf.Tensor: shape=(), dtype=float64, numpy=0.2>)])
OrderedDict([(b'scores.list.item', <tf.Tensor: shape=(), dtype=float64, numpy=0.3>)])
OrderedDict([(b'scores.list.item', TensorSpec(shape=(), dtype=tf.float64, name=None))])

tfio version is 0.24.0

kvarekamp avatar Feb 17 '22 08:02 kvarekamp

+1 Same issue. Anyone found a resolution?

arash2060 avatar Feb 18 '22 21:02 arash2060

+1 same issue.

wangxj03 avatar Feb 21 '22 18:02 wangxj03

+1 same issue

deanoserrentino avatar Feb 23 '22 20:02 deanoserrentino

+1 same issue

deanbudd avatar Mar 15 '22 23:03 deanbudd

Having the same issue. I'd love to be using IODataset.from_parquet instead of putting together a custom generator and creating a dataset from that. Any idea if/when this issue will be picked up?

dholland42 avatar Mar 28 '22 18:03 dholland42

+1 same issue. Has anyone found the right way to use it?

Trangle avatar Apr 25 '22 07:04 Trangle

+1 same issue.

Since I am working in Databricks/PySpark, I will likely use the petastorm library to load the parquet files into a TensorFlow dataset using their make_petastorm_dataset() wrapped around their make_batch_reader() function. This is not really fixing the problem with tensorflow IO but could be an option for some of you. And I'll gladly be kept in the loop for a solution!

RobindeGrootNL avatar May 17 '22 13:05 RobindeGrootNL

您好,邮件已收到。

Trangle avatar May 17 '22 13:05 Trangle

seems this problem would not be solved currently!

Trangle avatar Jun 16 '22 05:06 Trangle

Also facing this issue on 0.26.0, any updates when this will be fixed?

Harshith-Batchu avatar Sep 02 '22 04:09 Harshith-Batchu

您好,邮件已收到。

Trangle avatar Sep 02 '22 04:09 Trangle

+1

dahiyaaneesh avatar Dec 14 '22 09:12 dahiyaaneesh

facing same with 0.31.0

shivamsbatra avatar Mar 01 '23 06:03 shivamsbatra

您好,邮件已收到。

Trangle avatar Mar 01 '23 06:03 Trangle

+1

leandrolcampos avatar Mar 01 '23 20:03 leandrolcampos

+1

John1203 avatar Sep 25 '23 10:09 John1203

+1

nicholas-entis avatar Jan 17 '24 17:01 nicholas-entis

+1

thisisjaymehta avatar Mar 19 '24 12:03 thisisjaymehta