CrossStagePartialNetworks Gradient calculation in paper

Hi, I am interested in CSPNet recently, and reading the paper: https://arxiv.org/pdf/1911.11929.pdf. But I have a question about the gradient calculation in page 4, in the paper the gradient calculate as

w₁^' = f(w₁, g₀) w₂^' = f(w₂, g₀, g₁) ... w_k^' = f(w_k, g₀, g₁, g₂, ..., g_k-1)

Don't this part is calculated as this?

w₁^' = f(w₁, g₀, g₁, g₂, ..., g_k) w₂^' = f(w₂, g₁, g₂, ..., g_k) ... w_k^' = f(w_k, g_k)

also I want to confirm that if the definition of g_i is the partial differential of error to weight? that is, $\frac {\partial e} {\partial w_i} = g_i$

I was very confuse about this part, hope that you can help me.

May 15 '20 09:05 vb123er951

May 15 '20 09:05 WongKinYiu

I was still confuse about this part (what g_i). It mean : So why " We can find that large amount of gradient information are reused for updating weights of different dense layers. This will result in different dense layers repeatedly learn copied gradient information." Can you help me. Thank you

Jan 12 '21 07:01 baopmessi

if so how does the gradient of weight of layer_0 express ? what's the mean of ?

Oct 17 '21 09:10 Pcyslist

if you definate then you would have to definate the g_0 of k+1_th layer as following so the red rectangle of g_0s wil not be the same things , Your explanation of g0 is contrary to the repeated g0 in your paper.

Oct 17 '21 10:10 Pcyslist

Please notice that $g_{i}$ represents the gradient propagated to the $i^{th} layer, not the gradient generated from the $i^{th}$ layer.

The equation with red rectangle contains only one timestamp of full weights updating, so it shows the case of out-degrees of $k$-th layer. The full weight updating process will accumulate the whole timestamps of gradient.

It is too complicate to show timestamps $t_{j}$s in a equation. If you want to add gradient information of ${k+1}$-th layer in this equation, it means to add gradient generated at ${t-1}$-th timestamp. For this case, the $g$ have to add timestamp annotation too. For example $g^{t}_{0}$ and $g^{t-1}_{0}$. To understand more details about timestamp and partition of gradient, you may want to see Figure 5 and 6 of PRN paper. Edit: for general case, you have to note from_where, to_where, and timestamp of all gradients.

Oct 17 '21 10:10 WongKinYiu

thanks for your reply. @WongKinYiu As you said, the gradient of the k+1_th layer is generated in the k-1_th timestamp, which is the general law of back propagation updating weight algorithm . so in equation 6, when calculating the gradient information generated by the k_th layer , we need to use only the generated gradients of (k+1~k+n) layers and the weights information of layers (1 ~ k-1), but not the gradient information of layer (1 ~ k-1) , because the gradients information of layers (1 ~ k-1) gi has not been generated (). So why do you update Wk in the formula with gi (1 < = i < = k-1) that has not been generated yet? Maybe I mean you should replace gi with wi. your explaination of gi is that gradient propagated to i_th layer , but what does the update of Wk have to do with gi ? maybe Wi will be OK?

Oct 17 '21 14:10 Pcyslist

Please notice that $g_{i}$ represents the gradient propagated to the $i^{th} layer, not the gradient generated from the $i^{th}$ layer. It means in the equation, the $g_{0}$, $g_{1}, ... are all generated by $k$-th layer at timestamp $t$, and then propagate to $0$-th, $1$-th, ... layers. In your description, the $g_{i}$ still means gradient generated by $i$-th layer, which is not same as our definition.

At a specific timestamp $t$, the gradient will propagate to all of layers which have shortcut layer connect to the current layer. Since the DenseNet has shortcut layers which connect to all previous layers, the gradient used to update $k$-th layer will also propagate to all of $0$-th, $1$-th, ... ${k-1}$-th layers. And due to the architecture has concatenations, it leads the equation become (1) and (2). From (1), you could see inputs of $k$-th layer is concatenation of outputs of all previous layers. It obviously the gradient for updating weights of $k$-th layer will propagate to all of previous layers according to their channel dimension.

Just take a glance on the figure you could know how $g_{0}$, $g_{1}$, ... are used to update weights of which layers.

Oct 17 '21 15:10 WongKinYiu

@WongKinYiu Thanks very much for your patient reply. Good luck to you.

Oct 17 '21 15:10 Pcyslist

@WongKinYiu Thanks very much for your patient reply. Good luck to you.

Oct 17 '21 15:10 Pcyslist

@WongKinYiu dear author ,i still dont understand why we should use g_{0} to update w1.In your discription, g_{0} equals to ,but if we are going to update w1,we should use in order to calculate using the train rule. and that is why i think there is no connection between and only when varieble x_k change can they affect the weights in previous layers,while the weights in later layer do nothing to previous ones.

And another question: what do you mean by truncated？

Feb 06 '22 09:02 NeoZng

After reading the auther's interpretation above, why I think the gradient should be propagated to i-th layer?

Apr 13 '22 07:04 JianjianSha