CaffeOnSpark icon indicating copy to clipboard operation
CaffeOnSpark copied to clipboard

image dataset on HDFS

Open sdikby opened this issue 9 years ago • 7 comments

Hi everybody,

i am curious about how CaffeOnSpark process the image dataset on HDFS? i have gone a little bit through the source code but i didn't find how. For example how it deals with the block size problem. Or have i understood everything wrong and the image dataset to train on (i suppose millions of images in same cases) are not saved on HDFS and only the training models are saved on HDFS. I hope that someone here would have give me his time and answer my questions, and for that i would be very thankfull :)

sdikby avatar Nov 26 '16 16:11 sdikby

Depending on your data format, the dataset is handled by the relevant class. For example, if you use data frame to store your images, labels, etc, then the file below will read the dataset. Essentially, it uses spark's data frame API. HDFS is natively supported by Spark. CaffeOnSpark takes advantage of this native support.

https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/DataFrameSource.scala#L80-L107

junshi15 avatar Nov 26 '16 17:11 junshi15

@junshi15 Thank you for your reply. Unfortunately i fail at understanding the data workflow from getting the image data set to train on to HDFS until its use by CaffeOnSpark. I mean

  1. how CaffeOnSpark could be used to get some image dataset from a local/remote storage to HDFS and how it is stored
  2. how the user provides the dataset path in HDFS to CaffeOnSpark (hdfs://path/to/image_dataset) What is the use of the LMDB database in HDFS. PS: I am totally new to Caffe/CaffeOnSpark

sdikby avatar Nov 28 '16 12:11 sdikby

First, you prepare the dataset. Image dataset can be stored on HDFS by multiple format (e.g. sequence file, data frame, lmdb, lmdb is not encouraged for large dataset since it is not a distributed file format).

Then you tell CaffeOnSpark where the dataset is located and what format it is in, e.g. https://github.com/yahoo/CaffeOnSpark/blob/master/data/lenet_dataframe_train_test.prototxt#L10-L12

junshi15 avatar Nov 28 '16 14:11 junshi15

Thank you one more time @junshi15 . if understood you well, the image processing part between hdfs and caffeOnSpark is done manually?, in other words the loading of the image dataset to hdfs from a local storage and the conversion in a specific format.

another question if some images(jpg or png) are to be stored in sequence file or data frame on hdfs , they will be like merged in one file right ? so where the metadata will be stored ? and the labels ?

sdikby avatar Nov 28 '16 14:11 sdikby

Yes, you need to generate the dataset manually before training/testing. We provide some example tools: https://github.com/yahoo/CaffeOnSpark/tree/master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/tools You can build your conversion tools if they don't meet your requirement.

The best format for me is data frame, which is just a table, you can have columns like image, label, etc, then you fill up the columns, generate the data frame file, tell CaffeOnSpark where the file is located and which column is which.

junshi15 avatar Nov 28 '16 17:11 junshi15

@junshi15 sorry for the very late reply. but i still curious about how caffeonspark/spark deals with image storage(read/write operations) on hdfs. It will be very nice from you if you explain it to me. Is data-frames a persisted storage format, or will it be created only during the task processing and then it will exist no more ? If so what is the possible ways to store images, image metadata and some extracted knowledge from images on hdfs ? I thank you for your time and help

sdikby avatar Feb 01 '17 17:02 sdikby

CaffeOnSpark does not read individual image files (although caffe does). The images need to be saved in some spark-friendly formats. Data-frame is one of them. You can think of data-frame as a table, you can have one column named "image name", one column named "data", which saves the actual pixel or raw jpeg bytes. one column named "label", to indicate if it is a "dog" or "cat" or whatever label you have.

Each row of the table is an training or test example. Spark takes the table, partition it, then feed to individual CaffeNet for training or testing.

We have a tool to convert images to a data frame. You can use it as a template. https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/tools/Binary2DataFrame.scala

junshi15 avatar Feb 01 '17 22:02 junshi15