I think that the advantage value here should be base on the old actor target_v = reward + args.gamma * self.critic_net(next_state)
target_v = reward + args.gamma * self.critic_net(next_state)