blueoil icon indicating copy to clipboard operation
blueoil copied to clipboard

Docker image for CPU

Open hadusam opened this issue 5 years ago • 5 comments

Introduction

We don't really need GPU for convert part, but the current docker image provided by us needs GPU driver to run. I think it is better to provide a docker image for CPU.

What is preventing this?

If we run convert with tensorflow-cpu, following error occurred.

self = <tensorflow.python.client.session.Session object at 0x7f26eaf44d30>
fn = <function BaseSession._do_run.<locals>._run_fn at 0x7f26f408abf8>
args = ({<tensorflow.python.pywrap_tensorflow_internal.TF_Output; proxy of <Swig Object of type 'TF_Output *' at 0x7f26ef166c...ywrap_tensorflow_internal.TF_Output; proxy of <Swig Object of type 'TF_Output *' at 0x7f26ef00be10> >], [], None, None)
message = "Node 'conv1/BatchNorm/FusedBatchNormV3' (type: '_FusedConv2D', num of outputs: 1) does not have output 1"
m = None

    def _do_call(self, fn, *args):
      try:
        return fn(*args)
      except errors.OpError as e:
        message = compat.as_text(e.message)
        m = BaseSession._NODEDEF_NAME_RE.search(message)
        node_def = None
        op = None
        if m is not None:
          node_name = m.group(3)
          try:
            op = self._graph.get_operation_by_name(node_name)
            node_def = op.node_def
          except KeyError:
            pass
        message = error_interpolation.interpolate(message, self._graph)
        if 'only supports NHWC tensor format' in message:
          message += ('\nA possible workaround: Try disabling Grappler optimizer'
                      '\nby modifying the config for creating the session eg.'
                      '\nsession_config.graph_options.rewrite_options.'
                      'disable_meta_optimizer = True')
>       raise type(e)(node_def, op, message)
E       tensorflow.python.framework.errors_impl.OutOfRangeError: Node 'conv1/BatchNorm/FusedBatchNormV3' (type: '_FusedConv2D', num of outputs: 1) does not have output 1

/usr/local/pyenv/versions/3.6.3/envs/python3.6/lib/python3.6/site-packages/tensorflow_core/python/client/session.py:1384: OutOfRangeError

The details of this error can see #106 . This error happened in lmnet/export and this feature is for debugging and testing. https://github.com/blue-oil/blueoil/blob/4ad07e2ea85b005188f1fa0db1cc1d0daefcc1d1/lmnet/executor/export.py#L112-L137

According to comments, this is for the test or debug. So, we can avoid this error and provide CPU version, I think. To avoid this error, we already treat this as using flag(save_npy_for_debug) https://github.com/blue-oil/blueoil/blob/4ad07e2ea85b005188f1fa0db1cc1d0daefcc1d1/blueoil/cmd/convert.py#L160-L164

So it's easy to change the default value of save_npy_for_debug to False. https://github.com/blue-oil/blueoil/blob/4ad07e2ea85b005188f1fa0db1cc1d0daefcc1d1/blueoil/cmd/convert.py#L209

Tasks

If the above change is approved, I will do tasks as following

  • [ ] Change the default value of save_npy_for_debug to False and enable to change this via CLI option
  • [ ] Create Dockerfile for CPU and replace docker image for tests (run all tests on this image)
  • [ ] Add test for Dockerfile for GPU (run e2e test once)

hadusam avatar Jan 23 '20 08:01 hadusam

As discussed in #752, we need save_npy_for_debug option usual. So, to provide the CPU version of Blueoil we need solving that error about BatchNorm/FusedBatchNormV3.

This error seems to be TensorFlow's bug, I created an issue. https://github.com/tensorflow/tensorflow/issues/36456

hadusam avatar Feb 04 '20 08:02 hadusam

@iizukak

But, We want npy files often to run lm_*.so files. The default True value is good for me.

I have a question about this. We have lm_*.elf files and these need npy files.

$ ./lm_x86.elf 
Error: The number of arguments is invalid
Use: ./lm_x86.elf <.npy debug input file> <.npy debug expected output file>

In my understanding, we need only two npy files for input (e.g. 000_images_placeholder:0.npy) and for output (e.g. XXX_output:0.npy) for lm_*.elf. But currently, we dumped all ops of the network. Are these really necessary? And especially in inference, outputs of FusedBatchNormV3:1-5 are meanless outputs, I think. See TensorFlow Official Doc. If they are not necessary, I would like to apply a temporary fix to remove specified outputs of operation (FusedBatchNormV3:1-5) from saving npy files to escape this issue.

hadusam avatar Feb 05 '20 02:02 hadusam

@hadusam I'm not sure we need all ops or not. Some of the middle layers are used by elf file. But we may not need FusedBatchNormV3:1-5. It's good to try removing.

iizukak avatar Feb 05 '20 05:02 iizukak

After #789, we can run convert without GPU 🎉

BTW, I found our docker image can run without GPU by setting CUDA_VISIBLE_DEVICES=-1 or removing nvidia runtime from docker running option. 😮 So I think we already have a docker image for running on CPU without GPU env. 😄

I will add some document about how to run without GPU and after that, I will close this issue.

hadusam avatar Feb 06 '20 02:02 hadusam

@hadusam

I tried to add the document about CUDA_VISIBLE_DEVICES=-1. But now I am thinking that it might be better to set CUDA_VISIBLE_DEVICES=-1 as default on Docker.

Then the convert command automatically use CPU without setting CUDA_VISIBLE_DEVICES. Instead of that, users have to specify GPU number always in training. But it feels more natural to me.

What do you think about this ?

tk26eng avatar Jun 17 '20 04:06 tk26eng