java BlockLSTM output values differ from Python

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary):
TensorFlow version (use command below): 2.3.1
Python version: 3.8
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior The output of BlockLSTM differs from the one on Python.

Describe the expected behavior I expected the output from BlockLSTM to have the same outputs values as Python

Code to reproduce the issue I am using TF Java 0.2.0 and created a simple spike with hardcoded values in Java like this:

public static void main(String[] args) {
        EagerSession session = EagerSession.create();
        Ops tf = Ops.create(session);
        Scope scope = new Scope(session);

        float[][][] rawInputSequence = {{{0.1f, 0.2f}}, {{0.3f, 0.4f}}};
        Operand<TFloat32> inputSequence = tf.constant(rawInputSequence);

        int cellSize = 3;
        int[] cellShape = {1, cellSize};
        Operand<TInt32> cellDims = Constant.vectorOf(scope, cellShape);
        Operand<TInt64> seqLenMax = tf.array(2L);
        Operand<TFloat32> initialCellState = Zeros.create(scope, cellDims, TFloat32.DTYPE);
        Operand<TFloat32> initialHiddenState = Zeros.create(scope, cellDims, TFloat32.DTYPE);
        long[] biasShape = {cellSize * 4};
        Operand<TInt64> biasDim = Constant.vectorOf(scope, biasShape);
        Operand<TFloat32> bias = Zeros.create(scope, biasDim, TFloat32.DTYPE);

        FloatNdArray matrix = StdArrays.ndCopyOf(new float[][]{
            {1.6652163f, 1.366376f, 0.7786316f, 0.9834321f, 1.6551187f, -0.6363001f, -0.4229284f, 0.63195646f, 0.6605189f, -0.6906152f, 3.1515226f, 1.970373f},
            {1.9458166f, 0.9790728f, 0.7476161f, -1.6813406f, -0.75150734f, 0.13104685f, 0.004470979f, 0.009482844f, -1.1464607f, 0.5036645f, 1.3567412f, 0.71478313f},
            {0.5393334f, -0.6881541f, 1.5186735f, 1.3431606f, -0.61521095f, -2.1862414f, 1.2603592f, -0.33593372f, -0.48804748f, -0.34496853f, -0.8777565f, 0.9202126f},
            {1.3439888f, 0.32253885f, -0.7401764f, 0.10057431f, -1.3759913f, 0.08382488f, 0.56741005f, 2.207029f, -0.0066946335f, -0.8636334f, 1.9623716f, 0.14416508f},
            {-0.925145f, 0.2283957f, 0.79638815f, 0.2288384f, 0.7052175f, -0.18524477f, -2.308545f, 1.2240901f, 2.014674f, 0.6235778f, -0.15852839f, 0.17711076f}
        });
        Operand<TFloat32> weightMatrix = Constant.tensorOf(scope, matrix);

        FloatNdArray gatesMatrix = StdArrays.ndCopyOf(new float[]{1.6652163f, 1.366376f, 0.7786316f});
        Operand<TFloat32> weighGates = Constant.tensorOf(scope, gatesMatrix);

        BlockLSTM<TFloat32> blockLSTM = BlockLSTM.create(scope, seqLenMax, inputSequence, initialCellState,
                initialHiddenState, weightMatrix, weighGates, weighGates, weighGates, bias);

}

Whereas for Python I created the code below:

input_sequence = tf.constant([[[0.1, 0.2]], [[0.3, 0.4]]])
cell_size = 3
ini_cell_state = tf.zeros(shape=[1, cell_size])
ini_hidden_state = tf.zeros(shape=[1, cell_size])
bias = tf.zeros(shape=[cell_size * 4])
seq_len_max = tf.constant([2], dtype="int64")

weight_matrix = tf.constant([
    [1.6652163, 1.366376, 0.7786316, 0.9834321, 1.6551187, -0.6363001, -0.4229284, 0.63195646, 0.6605189, -0.6906152, 3.1515226, 1.970373],
    [1.9458166, 0.9790728, 0.7476161, -1.6813406, -0.75150734, 0.13104685, 0.004470979, 0.009482844, -1.1464607, 0.5036645, 1.3567412, 0.71478313],
    [0.5393334, -0.6881541, 1.5186735, 1.3431606, -0.61521095, -2.1862414, 1.2603592, -0.33593372, -0.48804748, -0.34496853, -0.8777565, 0.9202126],
    [1.3439888, 0.32253885, -0.7401764, 0.10057431, -1.3759913, 0.08382488, 0.56741005, 2.207029, -0.0066946335, -0.8636334, 1.9623716, 0.14416508],
    [-0.925145, 0.2283957, 0.79638815, 0.2288384, 0.7052175, -0.18524477, -2.308545, 1.2240901, 2.014674, 0.6235778, -0.15852839, 0.17711076]
])

weight_gates = tf.constant([1.6652163, 1.366376, 0.7786316])

block_lstm = tf.raw_ops.BlockLSTM(seq_len_max=seq_len_max, x=input_sequence, cs_prev=ini_cell_state,
                                  h_prev=ini_hidden_state, w=weight_matrix,
                                  wci=weight_gates, wcf=weight_gates,
                                  wco=weight_gates, b=bias)

Other info / logs Output values for Java:

Input Gate: [0.6354535, 0.5823559, 0.5566029, 0.7944415, 0.6913817, 0.61125195]
Cell State: [-0.026291894, 0.037853386, -0.09006144, -0.016073283, 0.14818707, -0.25373507]
Forget State: [0.4407978, 0.50380254, 0.49064595, 0.4001113, 0.5333729, 0.4756381]
Output Gate: [0.50791717, 0.6425618, 0.58418906, 0.48629615, 0.825764, 0.70244145]
Cell Input: [-0.041375007, 0.06500043, -0.16180556, -0.0069905715, 0.18513232, -0.34502697]
Cell Output: [-0.026285835, 0.03783531, -0.08981872, -0.016071897, 0.1471118, -0.24842645]
Hidden Output: [-0.013351027, 0.024311526, -0.052471116, -0.007815702, 0.12147963, -0.17450504]

Output values for Python:

Input Gate: [[[0.6354535 0.5823559 0.5566029]]
 [[0.7784116 0.7010059 0.5999128]]]
Cell State: [[[-0.14840049  0.00885719 -0.02081871]]
 [[-0.45026863  0.16232333  0.0025551 ]]]
Forget State: [[[0.7228417  0.7436625  0.69778234]]
 [[0.6925149  0.7713927  0.67951703]]]
Output Gate: [[[0.50791717 0.6425618  0.58418906]]
 [[0.501899   0.8273453  0.69143474]]]
Cell Input: [[[-0.23353477  0.01520923 -0.03740317]]
 [[-0.44642073  0.2218112   0.02784033]]]
Cell Output: [[[-0.1473206   0.00885695 -0.02081571]]
 [[-0.4221198   0.1609125   0.0025551 ]]]
Hidden Output: [[[-0.07482666  0.00569114 -0.01216031]]
 [[-0.2118615   0.13313021  0.00176668]]]

May 21 '21 11:05 danilojsl

@danilojsl I faced this such problem when building layers from a few ops (BlockLSTM is complex op in reality) As I understand it could be related to the determinism problem in TensorFlow described here

Try to repeat your experiment many times, understand how stable it is between runs

If you share the whole list of code with outputs, I will run it on my machine and post results too, if it's determinism< we will take different results with high probability, if I obtain the same result, the problem could be in different settings (maybe something missed) or Python implementation

May 26 '21 15:05 zaleslaw

There's a determinism problem in the TF C API with how it builds the gradient operations too. I've reported it upstream but no one has looked at it yet.

May 26 '21 15:05 Craigacp

Great Adam, we think in the same directions here! Could you post a link here about the issue in TF C API and grads?

May 26 '21 15:05 zaleslaw

https://github.com/tensorflow/tensorflow/issues/48855

May 26 '21 16:05 Craigacp

The C API gradient issue is pretty bad. You can see it in this test here - https://github.com/tensorflow/java/blob/master/tensorflow-framework/src/test/java/org/tensorflow/framework/optimizers/GradientDescentTest.java#L121, where it gives inconsistent answers running the same tiny gradient computation 10 times. The test is a little weird due to #317, but that's a separate issue.

May 26 '21 19:05 Craigacp