blueoil
blueoil copied to clipboard
Docker image for CPU
Introduction
We don't really need GPU for convert
part, but the current docker image provided by us needs GPU driver to run.
I think it is better to provide a docker image for CPU.
What is preventing this?
If we run convert
with tensorflow-cpu, following error occurred.
self = <tensorflow.python.client.session.Session object at 0x7f26eaf44d30>
fn = <function BaseSession._do_run.<locals>._run_fn at 0x7f26f408abf8>
args = ({<tensorflow.python.pywrap_tensorflow_internal.TF_Output; proxy of <Swig Object of type 'TF_Output *' at 0x7f26ef166c...ywrap_tensorflow_internal.TF_Output; proxy of <Swig Object of type 'TF_Output *' at 0x7f26ef00be10> >], [], None, None)
message = "Node 'conv1/BatchNorm/FusedBatchNormV3' (type: '_FusedConv2D', num of outputs: 1) does not have output 1"
m = None
def _do_call(self, fn, *args):
try:
return fn(*args)
except errors.OpError as e:
message = compat.as_text(e.message)
m = BaseSession._NODEDEF_NAME_RE.search(message)
node_def = None
op = None
if m is not None:
node_name = m.group(3)
try:
op = self._graph.get_operation_by_name(node_name)
node_def = op.node_def
except KeyError:
pass
message = error_interpolation.interpolate(message, self._graph)
if 'only supports NHWC tensor format' in message:
message += ('\nA possible workaround: Try disabling Grappler optimizer'
'\nby modifying the config for creating the session eg.'
'\nsession_config.graph_options.rewrite_options.'
'disable_meta_optimizer = True')
> raise type(e)(node_def, op, message)
E tensorflow.python.framework.errors_impl.OutOfRangeError: Node 'conv1/BatchNorm/FusedBatchNormV3' (type: '_FusedConv2D', num of outputs: 1) does not have output 1
/usr/local/pyenv/versions/3.6.3/envs/python3.6/lib/python3.6/site-packages/tensorflow_core/python/client/session.py:1384: OutOfRangeError
The details of this error can see #106 .
This error happened in lmnet/export
and this feature is for debugging and testing.
https://github.com/blue-oil/blueoil/blob/4ad07e2ea85b005188f1fa0db1cc1d0daefcc1d1/lmnet/executor/export.py#L112-L137
According to comments, this is for the test or debug. So, we can avoid this error and provide CPU version, I think. To avoid this error, we already treat this as using flag(save_npy_for_debug) https://github.com/blue-oil/blueoil/blob/4ad07e2ea85b005188f1fa0db1cc1d0daefcc1d1/blueoil/cmd/convert.py#L160-L164
So it's easy to change the default value of save_npy_for_debug
to False
.
https://github.com/blue-oil/blueoil/blob/4ad07e2ea85b005188f1fa0db1cc1d0daefcc1d1/blueoil/cmd/convert.py#L209
Tasks
If the above change is approved, I will do tasks as following
- [ ] Change the default value of
save_npy_for_debug
toFalse
and enable to change this via CLI option - [ ] Create Dockerfile for CPU and replace docker image for tests (run all tests on this image)
- [ ] Add test for Dockerfile for GPU (run e2e test once)
As discussed in #752, we need save_npy_for_debug
option usual. So, to provide the CPU version of Blueoil we need solving that error about BatchNorm/FusedBatchNormV3
.
This error seems to be TensorFlow's bug, I created an issue. https://github.com/tensorflow/tensorflow/issues/36456
@iizukak
But, We want npy files often to run lm_*.so files. The default True value is good for me.
I have a question about this. We have lm_*.elf
files and these need npy
files.
$ ./lm_x86.elf
Error: The number of arguments is invalid
Use: ./lm_x86.elf <.npy debug input file> <.npy debug expected output file>
In my understanding, we need only two npy
files for input (e.g. 000_images_placeholder:0.npy
) and for output (e.g. XXX_output:0.npy
) for lm_*.elf
.
But currently, we dumped all ops of the network. Are these really necessary?
And especially in inference, outputs of FusedBatchNormV3:1-5
are meanless outputs, I think. See TensorFlow Official Doc.
If they are not necessary, I would like to apply a temporary fix to remove specified outputs of operation (FusedBatchNormV3:1-5
) from saving npy
files to escape this issue.
@hadusam
I'm not sure we need all ops or not. Some of the middle layers are used by elf file.
But we may not need FusedBatchNormV3:1-5
.
It's good to try removing.
After #789, we can run convert without GPU 🎉
BTW, I found our docker image can run without GPU by setting CUDA_VISIBLE_DEVICES=-1
or removing nvidia
runtime from docker running option. 😮
So I think we already have a docker image for running on CPU without GPU env. 😄
I will add some document about how to run without GPU and after that, I will close this issue.
@hadusam
I tried to add the document about CUDA_VISIBLE_DEVICES=-1
.
But now I am thinking that it might be better to set CUDA_VISIBLE_DEVICES=-1 as default on Docker.
Then the convert command automatically use CPU without setting CUDA_VISIBLE_DEVICES
.
Instead of that, users have to specify GPU number always in training.
But it feels more natural to me.
What do you think about this ?