haskell icon indicating copy to clipboard operation
haskell copied to clipboard

Support tensorflow debugger

Open theedge456 opened this issue 7 years ago • 7 comments

To debug my model, I thought I could connect my program to tensorboard to decipher the cryptic msg:

TensorFlowException TF_INVALID_ARGUMENT "In[0] is not a matrix\n\t [[Node: MatMul_70 = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device=\"/job:localhost/replica:0/task:0/device:CPU:0\"](Const_41, Mean_69)]]"

I could not find the equivalent to the python function:

tf_debug.TensorBoardDebugWrapperSession("machine:7000")

is it implemented ? If not, is it in the pipeline ? Fabien

theedge456 avatar Sep 05 '18 16:09 theedge456

There isn't any support for the tensorflow debugger right now. I'm not sure what work is required to support it.

A short-term workaround might be to use asGraphDef to get the graph as a proto, then write it to a file and load it into tensorboard so that you can more easily inspect the graph to figure out what part of your code that MatMul is coming from.

For the cryptic error messages: We should prioritize https://github.com/tensorflow/haskell/issues/24 so that these look like nice compiler errors that point to the line of code causing an issue.

fkm3 avatar Sep 05 '18 17:09 fkm3

Actually, instead of asGraphDef, you can use logGraph to write to a tensorboard log file directly: https://tensorflow.github.io/haskell/haddock/tensorflow-logging-0.2.0.0/TensorFlow-Logging.html#v:logGraph Just make sure to do that before you try to build the graph, otherwise you'll get the tensorflow runtime exception first.

fkm3 avatar Sep 05 '18 17:09 fkm3

logGraph allows to start tensorboard. Unfortunately, the graph loading process hangs at about 30% with the message: Data: Parsing graph.pbtxt

I made a little progress but I don't understand the following message: TensorFlowException TF_INVALID_ARGUMENT "Incompatible shapes: [784,500] vs. [500,784]\n\t [[Node: Mul_43 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Relu_40, Transpose_42)]]"

Does it mean that the dimensions in the node Mul_43 are incorrect ? Thanks for the effort anyway.

theedge456 avatar Sep 06 '18 15:09 theedge456

Hmm. You may need to make sure the withEventWriter call exits before the error happens, otherwise it may not have flushed the file write yet and so the graph.pbtxt will be incomplete.

TF.withEventWriter "/path/to/logs" $ \eventWriter -> TF.logGraph eventWriter graph

-- Other code that actually runs the graph.

Does it mean that the dimensions in the node Mul_43 are incorrect ?

That does seem to be what it is saying, but the dimension look compatible to me... If you have any code you can share I can take a look.

fkm3 avatar Sep 06 '18 17:09 fkm3

code.tar.gz I tried to remove all the un-necessary code from the file. The cabal project is built in a sandbox. The error is: TensorFlowException TF_INVALID_ARGUMENT "Incompatible shapes: [500,784] vs. [784,500]\n\t [[Node: Mul_7 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_inputToto/XTiti_0_0, ReadVariableOp_6)]]"

theedge456 avatar Sep 07 '18 14:09 theedge456

I had to make a few edits to get the code to compile, e.g. I got this error

.../src/RBM.hs:117:45: error:
    Variable not in scope: h0 :: TFT.Tensor v0 t0
    |
117 |         TFL.scalarSummary (pack "update_w") h0 -- update_w
    |                                             ^^

After renaming h_sampleProbArg to h0 and adding a Main module, I was able to build. I couldn't reproduce the error though, it ran fine for me.

fkm3 avatar Sep 11 '18 01:09 fkm3

I switched to the python version of the code as it runs flawlessly. Thanks for your support anyway

theedge456 avatar Sep 11 '18 15:09 theedge456