Hadoop icon indicating copy to clipboard operation
Hadoop copied to clipboard

Does it support sequence file from pyspark?

Open twmht opened this issue 9 years ago • 1 comments
trafficstars

hi,

Recently I use pyspark to write image to sequence file.

I Use scikit-image and numpy to convert/restore image data to bytearray, but failed to restore the image from the sequence file.

Here is how I write the image to sequence file in spark

from PIL import Image
from io import BytesIO
def write():
    bg = io.imread(image_file_name)
    # check if it can restore to images
    np.fromstring(bg.tobytes(), dtype = np.uint8).reshape((bg.shape[0], bg.shape[1], bg.shape[2]))
    return [('image:%s-%d-%d-%d' %(filename[0], bg.shape[0], bg.shape[1], bg.shape[2]), bg.tobytes())]

but it failed to restore from the sequence file

    reader = SequenceFile.Reader('image.seq')

    key_class = reader.getKeyClass()
    value_class = reader.getValueClass()
    print type(value_class)

    key = key_class()
    value = value_class()
    print type(value)

    #reader.sync(4042)
    position = reader.getPosition()
    while reader.next(key, value):
        #  print '*' if reader.syncSeen() else ' ',
        #  print '[%6s] %6s %6s' % (position, key.toString(), value.toString())
        key_str = key.toString()
        if key_str.startswith(IMAGE_KEY):
            filename, width, height, channel  = key_str[len(IMAGE_KEY):].split('-')
            # failed to convert to image
             np.fromstring(value.getBytes(), dtype=np.uint8).reshape(width, height, channel)
        position = reader.getPosition()

    reader.close()

Here is the sequence file

https://drive.google.com/file/d/0B18-oWPEXrIWMVpkME9RUFdCOEE/view?usp=sharing

thanks for the help.

twmht avatar Sep 06 '16 06:09 twmht

It is trickier. This project only gives you the raw byte stream. The actual bytes are serialized with so-called Java standard serialization. In best case, you have to offset 4 bytes to read BytesWritable's contents. In the worst case, you have to apply https://github.com/tcalmant/python-javaobj

Alternatively, try saveAsPickleFile() and use https://github.com/src-d/sparkpickle

vmarkovtsev avatar Nov 08 '16 09:11 vmarkovtsev