pxf icon indicating copy to clipboard operation
pxf copied to clipboard

Enable querying of images with PXF

Open oliverralbertini opened this issue 5 years ago • 0 comments

This feature allows querying of different image formats (jpg, png, etc.) from external sources such as HDFS/HCFS into Greenplum as integer arrays. The user can specify how many images to load into each row using a variable FILES_PER_FRAGMENT in the DDL.

For working with MadLib it's favorable to load a large number of images into Greenplum tuples. For example, you can load 675 or so 256 x 256 images into a Greenplum tuple, represented as a 675 x 256 x 256 x 3 integer array. In order to do this, we required some streaming behavior to avoid loading all images into memory. This commit introduces the new classes StreamingField, StreamingResolver, StreamingImageAccessor, StreamingImageResolver, StreamingImageReadBridge to tackle this problem.

The basic approach is for the StreamingField to hold a reference to a StreamingResolver type, which in turn holds a reference to an Accessor. When the StreamingField is being formatted in the BridgeOutputBuilder, it calls back to the accessor/resolver to read and resolve more images. This is done using an iterator pattern (next() and hasNext()).

When doing fragmentation for a large number (1.8 million) of images from places365 we saw a memory spike, which led to StreamingFragmenter, StreamingHdfsMultiFileFragmenter, StreamingFragmentsResponse being introduced. As described above, the StreamingFragmentsResponse holds a reference to a StreamingFragmenter type, and can callback to fetch more fragments. The iterator pattern is also used here.

The chunking is done by directory, so the user can spread the files across different directories and provide the option STREAM_FRAGMENTS=true in the DDL to reduce memory consumption during the fragmenter part of the query. StreamingHdfsMultiFileFragmenter searches for any file recursively from the user-provided path, but doesn't accept wildcard paths like /path/*/*.jpg.

The user must also define the table with 3 columns of type TEXT[], which hold the full path to each image, the directory where the image is located (serves as a label for machine learning) and the filename of the original image. Finally the last column must be type INT[] and will be of size FILES_PER_FRAGMENT x 256 x 256 x 3. If FILES_PER_FRAGMENT is not provided, it defaults to 1.

Co-authored-by: Nikhil Kak [email protected] Co-authored-by: Alex Denissov [email protected] Co-authored-by: Francisco Guerrero [email protected]

oliverralbertini avatar Jan 03 '20 17:01 oliverralbertini