darts
darts copied to clipboard
Approximate Architecture Gradient
I have few questions on the section : Approximate Architecture Gradient in the paper
- Why Evaluating the finite difference requires only two forward passes for the weights and two backward passes for α, and the complexity is reduced from
O(|α||w|)
toO(|α|+|w|)
? - Looking at equation 7, we have a second-order partial derivative which is computationally expensive to compute. To solve this, the finite difference method is used. <-- how is second-order partial derivative related to finite difference method ?
- We also note that when momentum is enabled for weight optimisation, the one-step unrolled learning objective in equation 6 is modified accordingly and all of our analysis still applies. <-- How is momentum directly related to the need of applying chain rule to equation 6 ?