Arraymancer icon indicating copy to clipboard operation
Arraymancer copied to clipboard

Address ref objects and seq overhead in autograd

Open mratsim opened this issue 5 years ago • 1 comments

Following refactoring of the autograd in #333, we got a huge perf an memory usage improvement by not creating a graph at all when in inference mode: 21% on example 06 Shakespeare text generation.

This is probably due to nodes created during inference not being cleaned up and choking the GC.

However, isolating the proc in charge of registering nodes into the graph foo_cache and register_node highlighted a critical overhead.

See how much time linear_cache, sigmoid_cross_entropy_cache, relu_cache takes compared to the proc, with a special attention to the register_node proc

2018-12-16_00-32-17

The cache procs are pretty much self-contained, for example for relu_cache:

proc relu_cache[TT](result: Variable[TT], a: Variable[TT]) =
  # Gate
  var gate: ReluActivation[TT]
  new gate
  gate.cache = result.value

  # Result setup
  result.grad = zeros_like(result.value)
  result.requires_grad = true

  # Add to graph
  register_node(
    "Relu",
    gate,
    relu_backward_ag[TT],
    result,
    a
  )

And register_node

func push[TT](ctx: Context[TT], node: Node[TT]) {.inline.}=
  ## Append a new operation to the context
  ctx.nodes.add(node)

func register_node*[TT](
        name: static string,
        gate: Gate[TT],
        backward: Backward[TT],
        result: Variable[TT] or seq[Variable[TT]],
        parents: varargs[Variable[TT]]) =
  ## Add an operation / gate as a new node in the computation graph
  var node: Node[TT]

  node.gate = gate
  node.backward = backward
  node.parents = @parents

  when result is Variable:
    node.payload = Payload[TT](kind: pkVar, variable: result)
  else:
    node.payload = Payload[TT](kind: pkSeq, sequence: result)

  when defined(debug):
    node.name = name

  parents[0].context.push node

mratsim avatar Dec 15 '18 23:12 mratsim

An alternative design to explore is that instead of having the parent Variables own their child result, we could have the children result own their parents. (see article: https://upcoder.com/17/who-owns-who).

This also maps cleanly to having the root being the final loss function and the inputs being the leaves. See colah's blog:

image

image

This may workaround https://github.com/nim-lang/Nim/issues/9974. Furthermore this would allow the GC to collect intermediate variables that are not used by a live computation result.

We would still need a way to do the operations in reverse in the proper order and context as a seq is the easiest but then it would also be a "co-owner" of each Variables. Otherwise we don't save order and introduce a topological sort phase before backprop. (Can we use something like intrusive singly linked list instead?)

One way to solve context ownership would be to store variables in a context, and only use indexes in that context for parents/child relationship.

Another area to explore would be optimising the payload. Currently it just naively references the result variable. It's no copy but it does increase refcounting. It could be a pointer to the gradient field instead (but we now have https://github.com/nim-lang/Nim/issues/9974 again).

Un unrelated optimisation would be like random and the rng state, having a global default context for autograd. It should only be initialised on first use (using {.global.} pragma) so that people only using the tensor part of Arraymancer don't have spurious (though tiny) costs.

mratsim avatar Dec 16 '18 10:12 mratsim