pxf
pxf copied to clipboard
Enable querying of images with PXF
This feature allows querying of different image formats (jpg, png, etc.) from
external sources such as HDFS/HCFS into Greenplum as integer arrays. The user
can specify how many images to load into each row using a variable
FILES_PER_FRAGMENT
in the DDL.
For working with MadLib it's favorable to load a large number of images into
Greenplum tuples. For example, you can load 675 or so 256 x 256 images into a
Greenplum tuple, represented as a 675 x 256 x 256 x 3 integer array. In order
to do this, we required some streaming behavior to avoid loading all images
into memory. This commit introduces the new classes StreamingField
,
StreamingResolver
, StreamingImageAccessor
, StreamingImageResolver
,
StreamingImageReadBridge
to tackle this problem.
The basic approach is for the StreamingField
to hold a reference to a
StreamingResolver
type, which in turn holds a reference to an Accessor
.
When the StreamingField
is being formatted in the BridgeOutputBuilder
, it
calls back to the accessor/resolver to read and resolve more images. This is
done using an iterator pattern (next()
and hasNext()
).
When doing fragmentation for a large number (1.8 million) of images from
places365 we saw a memory spike, which led to StreamingFragmenter
,
StreamingHdfsMultiFileFragmenter
, StreamingFragmentsResponse
being
introduced. As described above, the StreamingFragmentsResponse
holds a
reference to a StreamingFragmenter
type, and can callback to fetch more
fragments. The iterator pattern is also used here.
The chunking is done by directory, so the user can spread the files across
different directories and provide the option STREAM_FRAGMENTS=true
in the DDL
to reduce memory consumption during the fragmenter part of the query.
StreamingHdfsMultiFileFragmenter
searches for any file recursively from the
user-provided path, but doesn't accept wildcard paths like /path/*/*.jpg
.
The user must also define the table with 3 columns of type TEXT[]
, which hold
the full path to each image, the directory where the image is located (serves
as a label for machine learning) and the filename of the original image.
Finally the last column must be type INT[]
and will be of size
FILES_PER_FRAGMENT
x 256 x 256 x 3. If FILES_PER_FRAGMENT
is not provided,
it defaults to 1.
Co-authored-by: Nikhil Kak [email protected] Co-authored-by: Alex Denissov [email protected] Co-authored-by: Francisco Guerrero [email protected]