Bug in tf_agents.bandits.policies.linalg.conjugate_gradient?
Hello,
I tried using the conjugate_gradient in tf_agents.bandits.policies.linalg with different batch_size but with the same example (b_mat is batch_size columns of the same example) and for each batch_size, conjugate_gradient returns a different result. This is incorrect since columns in b_mat are the same, so the result matrix should be the same. This affects the predicted rewards in _predict_mean_reward_and_variance.
Then I tried replacing this conjugate_gradient implementation with tf.matmul(tf.linalg.inv(a_mat), b_mat) and got the correct result (the result matrix is the same).
Can you check if this is a bug? If yes, why you guys didn't simply use tf.matmul(tf.linalg.inv(a_mat), b_mat) and reinvented the wheel here?
Thanks.