ecosystem
ecosystem copied to clipboard
image array byte string encoding/decoding problem
I built a dataset on spark environment using spark-tensorflow-connector and read it on stand-alone training python script.
In dataset, there is an image byte string with several features. Btw the image binary string feature value is unexpected value.
expected
res = requests.get(batch['imageurl'][0].numpy(), stream=True)
im = Image.open(res.raw).resize([224,224])
im_np = np.array(im)
len(im.tobytes()), im_np.tobytes()[:50]
> (150528,
'\x83\x94\xa4\x8f\x94\x9a\x87\x88\x8c\x8d\x98\x9e\x7f\x80\x84~\x8e\x8ee`f\x9d\x9e\xa0\x8b\x92\xa2\xa3\x98\xa8\x9f\x9d\xa8\xa4\xa6\xb5\xc0\xb2\xb1\x88\x84\x85\xff\xff\xff\xff\xff\xff\xff\xff')
value in tfrecord
# in eager execution mode
batch['image_bin'][0].numpy()[:50]
> '\xc2\x83\xc2\x94\xc2\xa4\xc2\x8f\xc2\x94\xc2\x9a\xc2\x87\xc2\x88\xc2\x8c\xc2\x8d\xc2\x98\xc2\x9e\x7f\xc2\x80\xc2\x84~\xc2\x8e\xc2\x8ee`f\xc2\x9d\xc2\x9e\xc2\xa0\xc2\x8b\xc2\x92\xc2\xa2\xc2'
I found that image binary value in tfrecord is encoded in UTF-8.
len(batch['image_bin'][0].numpy().decode('utf-8')), batch['image_bin'][0].numpy().decode('utf-8')[:50]
> (150528,
u'\x83\x94\xa4\x8f\x94\x9a\x87\x88\x8c\x8d\x98\x9e\x7f\x80\x84~\x8e\x8ee`f\x9d\x9e\xa0\x8b\x92\xa2\xa3\x98\xa8\x9f\x9d\xa8\xa4\xa6\xb5\xc0\xb2\xb1\x88\x84\x85\xff\xff\xff\xff\xff\xff\xff\xff')
below is my pyspark code.
# define functions
def download(url):
res = requests.get(url, stream=True)
im = Image.open(res.raw).resize((224, 224))
return [np.array(im)[i].tobytes() for i in range(3)]
def download2(url):
res = requests.get(url, stream=True)
im = Image.open(res.raw).resize((224, 224))
return np.array(im).tobytes()
# udfs
download_udf_bin = udf(download2, BinaryType())
download_udf_str = udf(download2, StringType())
download_udf_arr_bin = udf(download, ArrayType(BinaryType()))
download_udf_arr_str = udf(download, ArrayType(StringType()))
# assign new columns
df = df.withColumn('image_arr_bin', download_udf_arr_bin(df['imageurl']))
df = df.withColumn('image_str_str', download_udf_arr_str(df['imageurl']))
....
I tested 4 approaches(BinaryType, StringType, ArrayType(BinaryType()), ArrayType(StringType())), but the results are same.
Is it normal or a bug?
How did you build your image dataset ? The reading code doesn't seem to use spark-tensorflow-connector
@manuzhang Here is code writing dataframe in tfrecord dataset by spark-tensorflow-connector.
df = df.repartition(repartition_size)
df.write.format("tfrecords").option("recordType", "Example").save(dst_path)
When reading image dataset built by spark-tensorflow-connector, image byte strings are double utf8 encoded..
Can you load it back with df.read
?
@seizeTheDayMin were you able to solve this? I am running into the same issue
@blake-varden Sorry for the late reply. Unfortunately, I can't solve this problem. So, I didn't use python-based spark(pyspark) for this task, but processed processed it with scala-based spark. There is no such issue in the scala-based spark.
I noticed the issue is caused by a problem with schema inferencing. Assuming the data is stored in a column that has BinaryType ()
, the data written can be correctedly parsed using tensorflow, which proves the encoding part to be correct. However, if you load the data back, the schema inferencing will decode the bytes into StringType()
, and this causes the problem.
The solution is to specify the schema explicitly when reading. For example:
df = (spark.read
.schema('jpeg binary, url string, label int')
.format("tfrecords")
.option("recordType", "SequenceExample")
.load(dir_tfrecord))
@blake-varden @seizeTheDayMin I find a solution for this question, you can return bytearray(image)
as pyspark BinaryType in python2, or bytes(image, encoding="raw_unicode_escape")
in python3. The image is binary string like requests.get().content
https://stackoverflow.com/a/65874499/15948434