ecosystem image array byte string encoding/decoding problem

I built a dataset on spark environment using spark-tensorflow-connector and read it on stand-alone training python script.

In dataset, there is an image byte string with several features. Btw the image binary string feature value is unexpected value.

expected

res = requests.get(batch['imageurl'][0].numpy(), stream=True)
im = Image.open(res.raw).resize([224,224])
im_np = np.array(im)
len(im.tobytes()), im_np.tobytes()[:50]

> (150528,
 '\x83\x94\xa4\x8f\x94\x9a\x87\x88\x8c\x8d\x98\x9e\x7f\x80\x84~\x8e\x8ee`f\x9d\x9e\xa0\x8b\x92\xa2\xa3\x98\xa8\x9f\x9d\xa8\xa4\xa6\xb5\xc0\xb2\xb1\x88\x84\x85\xff\xff\xff\xff\xff\xff\xff\xff')

value in tfrecord

# in eager execution mode
batch['image_bin'][0].numpy()[:50]
> '\xc2\x83\xc2\x94\xc2\xa4\xc2\x8f\xc2\x94\xc2\x9a\xc2\x87\xc2\x88\xc2\x8c\xc2\x8d\xc2\x98\xc2\x9e\x7f\xc2\x80\xc2\x84~\xc2\x8e\xc2\x8ee`f\xc2\x9d\xc2\x9e\xc2\xa0\xc2\x8b\xc2\x92\xc2\xa2\xc2'

I found that image binary value in tfrecord is encoded in UTF-8.

len(batch['image_bin'][0].numpy().decode('utf-8')), batch['image_bin'][0].numpy().decode('utf-8')[:50]
> (150528,
 u'\x83\x94\xa4\x8f\x94\x9a\x87\x88\x8c\x8d\x98\x9e\x7f\x80\x84~\x8e\x8ee`f\x9d\x9e\xa0\x8b\x92\xa2\xa3\x98\xa8\x9f\x9d\xa8\xa4\xa6\xb5\xc0\xb2\xb1\x88\x84\x85\xff\xff\xff\xff\xff\xff\xff\xff')

below is my pyspark code.

# define functions
def download(url):
            res = requests.get(url, stream=True)
            im = Image.open(res.raw).resize((224, 224))
            return [np.array(im)[i].tobytes() for i in range(3)]


def download2(url):
    res = requests.get(url, stream=True)
    im = Image.open(res.raw).resize((224, 224))
    return np.array(im).tobytes()

# udfs
download_udf_bin = udf(download2, BinaryType())
download_udf_str = udf(download2, StringType())

download_udf_arr_bin = udf(download, ArrayType(BinaryType()))
download_udf_arr_str = udf(download, ArrayType(StringType()))

# assign new columns
df = df.withColumn('image_arr_bin', download_udf_arr_bin(df['imageurl']))
df = df.withColumn('image_str_str', download_udf_arr_str(df['imageurl']))
....

I tested 4 approaches(BinaryType, StringType, ArrayType(BinaryType()), ArrayType(StringType())), but the results are same.

Is it normal or a bug?

Mar 12 '19 06:03 Yu0525

How did you build your image dataset ? The reading code doesn't seem to use spark-tensorflow-connector

Mar 25 '19 05:03 manuzhang

@manuzhang Here is code writing dataframe in tfrecord dataset by spark-tensorflow-connector.

df = df.repartition(repartition_size)
df.write.format("tfrecords").option("recordType", "Example").save(dst_path)

When reading image dataset built by spark-tensorflow-connector, image byte strings are double utf8 encoded..

Mar 25 '19 06:03 Yu0525

Can you load it back with df.read ?

Mar 25 '19 07:03 manuzhang

@seizeTheDayMin were you able to solve this? I am running into the same issue

Jan 08 '20 16:01 blake-varden

@blake-varden Sorry for the late reply. Unfortunately, I can't solve this problem. So, I didn't use python-based spark(pyspark) for this task, but processed processed it with scala-based spark. There is no such issue in the scala-based spark.

Aug 22 '20 12:08 Yu0525

I noticed the issue is caused by a problem with schema inferencing. Assuming the data is stored in a column that has BinaryType (), the data written can be correctedly parsed using tensorflow, which proves the encoding part to be correct. However, if you load the data back, the schema inferencing will decode the bytes into StringType(), and this causes the problem.

The solution is to specify the schema explicitly when reading. For example:

df = (spark.read
      .schema('jpeg binary, url string, label int')
      .format("tfrecords")
      .option("recordType", "SequenceExample")
      .load(dir_tfrecord))

Aug 23 '20 15:08 Twofyw

@blake-varden @seizeTheDayMin I find a solution for this question, you can return bytearray(image) as pyspark BinaryType in python2, or bytes(image, encoding="raw_unicode_escape") in python3. The image is binary string like requests.get().content https://stackoverflow.com/a/65874499/15948434

May 18 '21 03:05 huaileiseu

ecosystem ecosystem copied to clipboard

image array byte string encoding/decoding problem

ecosystem
ecosystem copied to clipboard