taichi Gradients seems incorrect in recursive formulation

Describe the bug When I was trying something like exponentially weighted moving average, I saw the gradients may be incorrect.

To Reproduce

ti.init(arch=ti.cuda)
row_num, column_num = 1, 2
data = ti.field(ti.f32, shape=(row_num, column_num), needs_grad=True)
np.random.seed(1)
data_np = np.random.rand(row_num, column_num)
print(f"data is {data_np}")
ewma = ti.field(ti.f32, shape=row_num, needs_grad=True)
loss = ti.field(ti.f32, shape=(), needs_grad=True)
data.from_numpy(data_np)


@ti.kernel
def calc_ewma():
    for row in range(row_num):
        for col in range(column_num):
            if col == 0:
                ewma[row] = data[row, col]
            else:
                # ewma[row] *= 0.5  # 1
                # ewma[row] += 0.5 * data[row, col]  # 1
                ewma[row] = 0.5 * data[row, col] + 0.5 * ewma[row]  # 2


@ti.kernel
def calc_loss():
    for row in range(row_num):
        loss[None] += ewma[row] ** 2


@ti.kernel
def print_grads():
    for row in range(row_num):
        for col in range(column_num):
            print(data.grad[row, col])


with ti.Tape(loss, clear_gradients=True):
    calc_ewma()
    calc_loss()

print_grads()

Log/Screenshots If my analysis is correct and aligned with my code, I think in this simple setting,

e0 = d0 # e0 for ewma[0] when `col==0`, d0 for data[0][0]
e1 = 0.5 * d0 + 0.5 * d1 # e0 for ewma[0] when `col==1`, d1 for data[0][1]
loss = e1^2
# thus
d loss/d e1 = 2*e1
d e1/d d0 = 0.5
d e1/d d1 = 0.5
# by chain rule
d loss/ d d0 = d loss/ d d1 = e1 = 0.5*d0 + 0.5 * d1

but the log is

[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-t4n_eqlj
[Taichi] version 0.7.20, llvm 10.0.0, commit 284f75ed, linux, python 3.8.8
[Taichi] Starting on arch=cuda
data is [[0.417022   0.72032449]]
[Taichi] materializing...
1.706020
0.568673

Additional comments I don't know if the code marked #2 breaks the global data access rule, but I think the code marked #1 doesn't break the rule, but they give the same result.

Jun 14 '21 18:06 ifsheldon

Hi @ifsheldon ,

Thanks for reporting this. I took a look and it was indeed the global data access rule being the problem.

Taking the simpler #1 as an example:

@ti.kernel
def calc_ewma():
    for _ in range(1):
        for col in range(column_num):
            if col == 0:
              ewma[0] = data[0, 0] * 0.5
            else:
              ewma[0] *= 0.5  # 1

If you print the IR, the gradient kernel becomes something equivalent to:

# pseudo code
@ti.kernel
def calc_ewma_grad():
    for _ in range(1):
        for col in reversed(range(column_num)):  # col = 1, 0
            tmp = 0.5 * ewma.grad[0]
            if col == 0:
              data.grad[0, 0] += tmp
            else:
              ewma.grad[0] += tmp  # oops

Not sure if this is applicable to your actual code, but is it possible to extend ewma to be of shape (column_num,), then writing to different cells in calc_ewma()?

Finally, printing out the SNode ID of each fields can be helpful during the debugging:

"""
data=1
data_grad=3
ewma=5
ewma_grad=7
loss=9
loss_grad=11
"""
print(f'data={data.snode.parent().id}')
print(f'data_grad={data.grad.snode.parent().id}')
print(f'ewma={ewma.snode.parent().id}')
print(f'ewma_grad={ewma.grad.snode.parent().id}')
print(f'loss={loss.snode.parent().id}')
print(f'loss_grad={loss.grad.snode.parent().id}')

And below are the finalized IR for both calc_ewma() and its gradient kernel:

[compile_to_offloads.cpp:operator()@22] [calc_ewma_c32_0] Bit struct stores optimized:
kernel {
  $0 = offloaded range_for(0, 1) grid_dim=0 block_dim=32 
  body {
    <i32> $1 = const [1]
    <f32> $2 = const [0.5]
    <i32> $3 = const [0]
    <i32> $4 = const [2]
    $5 : for in range($3, $4) (vectorize 1) (bit_vectorize -1) block_dim=adaptive {
      <i32> $6 = loop $5 index 0
      <i32> $7 = cmp_eq $6 $3
      <i32> $8 = bit_and $7 $1
      <*gen> $9 = get root
      <*gen> $10 = [S0root][root]::lookup($9, $3) activate = false
      $11 : if $8 {
        <*gen> $12 = get child [S0root->S1dense] $10
        <*gen> $13 = [S1dense][dense]::lookup($12, $3) activate = false
        <*f32> $14 = get child [S1dense->S2place<f32>] $13
        <f32> $15 = global load $14
        <f32> $16 = mul $15 $2
        <*gen> $17 = get child [S0root->S5dense] $10
        <*gen> $18 = [S5dense][dense]::lookup($17, $3) activate = false
        <*f32> $19 = get child [S5dense->S6place<f32>] $18
        $20 : global store [$19 <- $16]
      } else {
        <*gen> $21 = get child [S0root->S5dense] $10
        <*gen> $22 = [S5dense][dense]::lookup($21, $3) activate = false
        <*f32> $23 = get child [S5dense->S6place<f32>] $22
        <f32> $24 = global load $23
        <f32> $25 = mul $24 $2
        $26 : global store [$23 <- $25]
      }
    }
  }
}

[compile_to_offloads.cpp:operator()@22] [calc_ewma_c33_0_grad_grad] Bit struct stores optimized:
kernel {
  $0 = offloaded range_for(0, 1) grid_dim=0 block_dim=32 
  body {
    <i32> $1 = const [1]
    <f32> $2 = const [0.5]
    <i32> $3 = const [0]
    <i32> $4 = const [2]
    $5 : reversed for in range($3, $4) (vectorize 1) (bit_vectorize -1) block_dim=adaptive {
      <i32> $6 = loop $5 index 0
      <i32> $7 = cmp_eq $6 $3
      <i32> $8 = bit_and $7 $1
      <*gen> $9 = get root
      <*gen> $10 = [S0root][root]::lookup($9, $3) activate = false
      <*gen> $11 = get child [S0root->S7dense] $10
      <*gen> $12 = [S7dense][dense]::lookup($11, $3) activate = false
      <*f32> $13 = get child [S7dense->S8place<f32>] $12
      <f32> $14 = global load $13
      <f32> $15 = mul $14 $2
      $16 : if $8 {
        <*gen> $17 = get child [S0root->S3dense] $10
        <*gen> $18 = [S3dense][dense]::lookup($17, $3) activate = false
        <*f32> $19 = get child [S3dense->S4place<f32>] $18
        <f32> $20 = atomic add($19, $15)
      } else {
        <f32> $21 = atomic add($13, $15)
      }
    }
  }
}

Jun 16 '21 01:06 k-ye

Can you explain more on why my code marked #1 breaks the global data access rule? Is it because I used *= instead of +=? I remember the rule only states += can be used.

Not sure if this is applicable to your actual code, but is it possible to extend ewma to be of shape (column_num,), then writing to different cells in calc_ewma()?

As for the use case, this EWMA example is primarily for testing the behavior of autodiff of Taichi, because EWMA is the simplest case I can come up with that contains a recursive formulation. I think whether to extend ewma doesn't matter much in this example, since this example is basically a EWMA calculator that is parallelized along rows and I cannot see why transposing my "data" and parallelizing the EWMA calculation along columns will mitigate this issue.

What I want to do is a complicated one, which is the Alpha Blending in Direct Volume Rendering that is somewhat similar to the below:

e = 0.0
for data in data_array:
   e = (1-e) * data + e

So, I have to deal with gradient calculation with recursive formulations.

Back to this EWMA example, is such a gradient error an expected result? If so, do you have any workarounds? Or, is Taichi currently limited in tracing gradients in such recursive structures?

Jun 16 '21 17:06 ifsheldon

Yeah, it's likely that this is a bug due to a combination of recursive structures and control flows. I was suggesting to R/W to different locations when you do the calculation, and see if this would mitigate your issue?

Jun 17 '21 01:06 k-ye

it's likely that this is a bug due to a combination of recursive structures and control flows

I think it is not the combination that caused the issue, but the R/W as you suggested, so I changed to this. The below code works and gives the correct result. I think my new expanded ewma servers as an explicit "tape" when steps evolve in time.

import taichi as ti
import numpy as np

ti.init(arch=ti.cuda)
row_num, column_num = 1, 2
data = ti.field(ti.f32, shape=(row_num, column_num), needs_grad=True)
np.random.seed(1)
data_np = np.random.rand(row_num, column_num)
print(f"data is {data_np}")
ewma = ti.field(ti.f32, shape=(row_num, column_num), needs_grad=True) # changed here
loss = ti.field(ti.f32, shape=(), needs_grad=True)
data.from_numpy(data_np)


@ti.kernel
def calc_ewma():
    for row in range(row_num):
        for col in range(column_num):
            if col == 0:
                ewma[row, col] = data[row, col] # changed
            else:
                ewma[row, col] = 0.5 * data[row, col] + 0.5 * ewma[row, col - 1] # changed


@ti.kernel
def calc_loss():
    for row in range(row_num):
        loss[None] += ewma[row, column_num - 1] ** 2


@ti.kernel
def print_grads():
    for row in range(row_num):
        for col in range(column_num):
            print(data.grad[row, col])


with ti.Tape(loss, clear_gradients=True):
    calc_ewma()
    calc_loss()
print_grads()

However, such an explicit "tape" seems space-consuming. Imagine we have an RNN with n parameters and we expect it to handle m steps of input, we will need at least m*n space for tracking gradients in this way. Are there any possibilities for optimizing space consumption of calculating gradient over time steps?

Jun 20 '21 18:06 ifsheldon

you can use the classical treeverse algorithm - basically, suppose the tape is n long, only store the element(checkpoint) at index 0.5n, 0.75n, 0.875n... etc, recomputing other element at needed. when a checkpoint is consumed in backpropagation, replan the geometric sequence at the 'other side of the tree' (suppose the 0.5n checkpoint is backpropagated, replan at 0.25, 0.375, 0.4375, etc)

this allow a O(log n) memory, O(n log n) time, for a O(log n) long feedforward computation.

another approach is to implement DTR, which try to do the about scheme, but use a bunch of system programming insight to give a better constant.

Dec 24 '21 14:12 MarisaKirisame

you can use the classical treeverse algorithm - basically, suppose the tape is n long, only store the element(checkpoint) at index 0.5n, 0.75n, 0.875n... etc, recomputing other element at needed. when a checkpoint is consumed in backpropagation, replan the geometric sequence at the 'other side of the tree' (suppose the 0.5n checkpoint is backpropagated, replan at 0.25, 0.375, 0.4375, etc)

this allow a O(log n) memory, O(n log n) time, for a O(log n) long feedforward computation.

another approach is to implement DTR, which try to do the about scheme, but use a bunch of system programming insight to give a better constant.

When I try to do automatic differentiation with this checkpointing scheme, the gradient calculation is still incorrect because the segmented data is stored with local number other than the global number. How to solve this problem?

Jul 23 '25 02:07 shift6666