Return (inv) root of KroneckerProductAddedDiagLinearOperator as lazy
This is a clone of https://github.com/cornellius-gp/gpytorch/pull/1430
Previously, _root_decomposition and _inv_root_decomposition were returning the (inv) root from the eigendecomposition as a dense tensor, which can be inefficient. This now returns the root as MatmulLazyTensor instead. E.g. a matrix vector product of the root with some vector v is now implicitly computed as q_matrix @ (evals \dot v) rather than (q_matrix @ diag(evals)) @ v, which can make a big difference since q_matrix is a Kronecker product.
This can help runtime, but more importantly it significantly reduces memory footprint, since we don't need to instantiate the (inv) root, but only the constituent components.
This also fixes an issue with KroneckerProductAddedDiagLinearOperator implicitly assuming the diagonal to be constant, and returning incorrect results if that was not the case. The changes here make the tensor fall back to the superclass implementation in case of non-constant diagonals.
I am not sure why in the case of using a ConstantDiagLinearOperator gradients aren't being returned properly (hence the test failure).