ONE icon indicating copy to clipboard operation
ONE copied to clipboard

[one-cmds] Define standard data format for `one-infer`

Open yunjayh opened this issue 2 years ago • 10 comments

What

Let's design standard data format, which type can be .npy, .h5, or .bin. Defining standard format of I/O data can be restricting the naming convention of files or data hierarchy.

Why

one-infer is designed to run multiple backend models through one wrapping binary, such as tflite backend, circle runtime or even npu backend. It is better to make a standard format of data for each backend I/O, for comparing the result values from each model to another type of model.

yunjayh avatar Jun 09 '22 05:06 yunjayh

Here is my first suggestion.

Let's assume there is a model which has 4 tensors as input and 2 tensors for output.

.npy file naming convention (same convention except extension part for .bin file)

$ tree .
.
├── {input_filename_prefix}.0.npy
├── {input_filename_prefix}.1.npy
├── {input_filename_prefix}.2.npy
├── {input_filename_prefix}.3.npy
├── {output_filename_prefix}.0.npy
└── {output_filename_prefix}.1.npy

.h5 file data hierarchy example

$ some-cmd-show-h5-hierarchy io_data.h5
# GROUP "/"
# ㄴGROUP "input"
#   ㄴDATASET "0"
#     ㄴDATA ... [shape (1,299,299,3)]
#   ㄴDATASET "1"
#     ㄴDATA ... [shape (2,2)]
#   ㄴDATASET "2"
#     ㄴDATA ... [shape (20,20)]
#   ㄴDATASET "3"
#     ㄴDATA ... [shape (2,2)]
# ㄴGROUP "output"
#   ㄴDATASET "0"
#     ㄴDATA ... [shape (1)]
#   ㄴDATASET "1"
#     ㄴDATA ... [shape (1, 1000)]

I have few concerns with above design.

  • It is not expandable for multiple input cases. In case of a user who wants to infer 1000 runs for a model.
  • Before writing h5 data, driver should check the filename.h5 existance and if it already exists, should append.

yunjayh avatar Jun 09 '22 05:06 yunjayh

Overall format looks nice :)

In case of a user who wants to infer 1000 runs for a model.

I can't catch what the problem is. Maybe you want to store 1000 different inputs in one .h5 file?

Before writing h5 data, driver should check the filename.h5 existance and if it already exists, should append.

I think this is about data file generation, not about format definition itself, or is there something else?

seanshpark avatar Jun 09 '22 07:06 seanshpark

I can't catch what the problem is. Maybe you want to store 1000 different inputs in one .h5 file?

Exactly. I assumed some backend driver can run multiple inference with a single command line execution. But, as of now, it's not a problem at all, IMHO. Even though if there is a such case, the user can run multiple one-infer with different input file names. (I'm sorry if you cannot follow how I raised the problem and self-solve it.. If so, please let me know)

I think this is about data file generation, not about format definition itself, or is there something else?

Yes. More clearly, this is not a problem but a caution for data generation step. :D thank you for fixing my words!

yunjayh avatar Jun 09 '22 07:06 yunjayh

existance and if it already exists, should append.

Although this is about data generation, why is it append not overwrite? Maybe share some scenario may help.

seanshpark avatar Jun 09 '22 07:06 seanshpark

Although this is about data generation, why is it append not overwrite? Maybe share some scenario may help.

Ah! Sorry for lack of my explanation.

The word append was a little bit ambiguous. Imagine the scenario a user executed inference process, and some data is going to dump to data.h5 file.

If data.h5 already exists (1) If /output/** dataset is not assigned : append the output data (Here, it will be appended) (2) If /output/** dataset is already occupied : overwrite the output data Else data.h5 doesn't exist, then make a data.h5 and dump it to /output/**.

yunjayh avatar Jun 09 '22 08:06 yunjayh

I think format itself looks OK to me. Maybe you may want to get feedback from @hyunsik-yoon ?

seanshpark avatar Jun 09 '22 08:06 seanshpark

One thing popped up. as the inputs and outputs are accessed by index ("0" for the first one), all the types of models we are going to execute should provide way of accessing inputs/outputs by index.

seanshpark avatar Jun 09 '22 09:06 seanshpark

Then with this format, let's move on to data converter (between npy, h5, and h5).

as the inputs and outputs are accessed by index ("0" for the first one), all the types of models we are going to execute should provide way of accessing inputs/outputs by index.

That's right. I didn't consider about that point yet..

yunjayh avatar Jun 10 '22 04:06 yunjayh

I think the goal of this issue is completed. Now, it's time to discuss how to convert from a data format to another format. I'll close this issue and make another one.

yunjayh avatar Jun 20 '22 03:06 yunjayh

Discussion about the format and hierarchy of h5 is currently done. But, it hasn't implemented yet, so I'll reopen and remain this as working in progress. Sorry for confussion. :sob:

yunjayh avatar Jun 20 '22 08:06 yunjayh