CrossStagePartialNetworks icon indicating copy to clipboard operation
CrossStagePartialNetworks copied to clipboard

Gradient calculation in paper

Open vb123er951 opened this issue 5 years ago • 11 comments

Hi, I am interested in CSPNet recently, and reading the paper: https://arxiv.org/pdf/1911.11929.pdf. But I have a question about the gradient calculation in page 4, in the paper the gradient calculate as

w1' = f(w1, g0) w2' = f(w2, g0, g1) ... wk' = f(wk, g0, g1, g2, ..., gk-1)

Don't this part is calculated as this?

w1' = f(w1, g0, g1, g2, ..., gk) w2' = f(w2, g1, g2, ..., gk) ... wk' = f(wk, gk)

also I want to confirm that if the definition of gi is the partial differential of error to weight? that is,

I was very confuse about this part, hope that you can help me.

vb123er951 avatar May 15 '20 09:05 vb123er951

image

WongKinYiu avatar May 15 '20 09:05 WongKinYiu

I was still confuse about this part (what g_i). It mean : imageimage So why " We can find that large amount of gradient information are reused for updating weights of different dense layers. This will result in different dense layers repeatedly learn copied gradient information." Can you help me. Thank you

baopmessi avatar Jan 12 '21 07:01 baopmessi

if image image so how does the gradient of weight of layer_0 express ? what's the mean of image?

Pcyslist avatar Oct 17 '21 09:10 Pcyslist

if you definate image then you would have to definate the g_0 of k+1_th layer as following image image so the red rectangle of g_0s wil not be the same things , Your explanation of g0 is contrary to the repeated g0 in your paper.

Pcyslist avatar Oct 17 '21 10:10 Pcyslist

Please notice that $g_{i}$ represents the gradient propagated to the $i^{th} layer, not the gradient generated from the $i^{th}$ layer.

The equation with red rectangle contains only one timestamp of full weights updating, so it shows the case of out-degrees of $k$-th layer. The full weight updating process will accumulate the whole timestamps of gradient.

It is too complicate to show timestamps $t_{j}$s in a equation. If you want to add gradient information of ${k+1}$-th layer in this equation, it means to add gradient generated at ${t-1}$-th timestamp. For this case, the $g$ have to add timestamp annotation too. For example $g^{t}_{0}$ and $g^{t-1}_{0}$. To understand more details about timestamp and partition of gradient, you may want to see Figure 5 and 6 of PRN paper. Edit: for general case, you have to note from_where, to_where, and timestamp of all gradients.

WongKinYiu avatar Oct 17 '21 10:10 WongKinYiu

thanks for your reply. @WongKinYiu As you said, the gradient of the k+1_th layer is generated in the k-1_th timestamp, which is the general law of back propagation updating weight algorithm . so in equation 6, when calculating the gradient information generated by the k_th layer , we need to use only the generated gradients of (k+1~k+n) layers and the weights information of layers (1 ~ k-1), but not the gradient information of layer (1 ~ k-1) , because the gradients information of layers (1 ~ k-1) gi has not been generated (). So why do you update Wk in the formula with gi (1 < = i < = k-1) that has not been generated yet? Maybe I mean you should replace gi with wi. your explaination of gi is that gradient propagated to i_th layer , but what does the update of Wk have to do with gi ? maybe Wi will be OK?

Pcyslist avatar Oct 17 '21 14:10 Pcyslist

Please notice that $g_{i}$ represents the gradient propagated to the $i^{th} layer, not the gradient generated from the $i^{th}$ layer. It means in the equation, the $g_{0}$, $g_{1}, ... are all generated by $k$-th layer at timestamp $t$, and then propagate to $0$-th, $1$-th, ... layers. In your description, the $g_{i}$ still means gradient generated by $i$-th layer, which is not same as our definition.

At a specific timestamp $t$, the gradient will propagate to all of layers which have shortcut layer connect to the current layer. Since the DenseNet has shortcut layers which connect to all previous layers, the gradient used to update $k$-th layer will also propagate to all of $0$-th, $1$-th, ... ${k-1}$-th layers. And due to the architecture has concatenations, it leads the equation become (1) and (2). From (1), you could see inputs of $k$-th layer is concatenation of outputs of all previous layers. It obviously the gradient for updating weights of $k$-th layer will propagate to all of previous layers according to their channel dimension.

Just take a glance on the figure you could know how $g_{0}$, $g_{1}$, ... are used to update weights of which layers. image

WongKinYiu avatar Oct 17 '21 15:10 WongKinYiu

@WongKinYiu Thanks very much for your patient reply. Good luck to you.

Pcyslist avatar Oct 17 '21 15:10 Pcyslist

@WongKinYiu Thanks very much for your patient reply. Good luck to you.

Pcyslist avatar Oct 17 '21 15:10 Pcyslist

@WongKinYiu dear author ,i still dont understand why we should use g_{0} to update w1.In your discription, g_{0} equals to image ,but if we are going to update w1,we should use image in order to calculate image using the train rule. and that is why i think there is no connection between image and image only when varieble x_k change can they affect the weights in previous layers,while the weights in later layer do nothing to previous ones.

And another question: image what do you mean by truncated?

NeoZng avatar Feb 06 '22 09:02 NeoZng

After reading the auther's interpretation above, why I think the gradient image should be propagated to i-th layer?

JianjianSha avatar Apr 13 '22 07:04 JianjianSha