Hadoop
Hadoop copied to clipboard
Does it support sequence file from pyspark?
hi,
Recently I use pyspark to write image to sequence file.
I Use scikit-image and numpy to convert/restore image data to bytearray, but failed to restore the image from the sequence file.
Here is how I write the image to sequence file in spark
from PIL import Image
from io import BytesIO
def write():
bg = io.imread(image_file_name)
# check if it can restore to images
np.fromstring(bg.tobytes(), dtype = np.uint8).reshape((bg.shape[0], bg.shape[1], bg.shape[2]))
return [('image:%s-%d-%d-%d' %(filename[0], bg.shape[0], bg.shape[1], bg.shape[2]), bg.tobytes())]
but it failed to restore from the sequence file
reader = SequenceFile.Reader('image.seq')
key_class = reader.getKeyClass()
value_class = reader.getValueClass()
print type(value_class)
key = key_class()
value = value_class()
print type(value)
#reader.sync(4042)
position = reader.getPosition()
while reader.next(key, value):
# print '*' if reader.syncSeen() else ' ',
# print '[%6s] %6s %6s' % (position, key.toString(), value.toString())
key_str = key.toString()
if key_str.startswith(IMAGE_KEY):
filename, width, height, channel = key_str[len(IMAGE_KEY):].split('-')
# failed to convert to image
np.fromstring(value.getBytes(), dtype=np.uint8).reshape(width, height, channel)
position = reader.getPosition()
reader.close()
Here is the sequence file
https://drive.google.com/file/d/0B18-oWPEXrIWMVpkME9RUFdCOEE/view?usp=sharing
thanks for the help.
It is trickier. This project only gives you the raw byte stream. The actual bytes are serialized with so-called Java standard serialization. In best case, you have to offset 4 bytes to read BytesWritable's contents. In the worst case, you have to apply https://github.com/tcalmant/python-javaobj
Alternatively, try saveAsPickleFile() and use https://github.com/src-d/sparkpickle