Arraymancer
Arraymancer copied to clipboard
Address ref objects and seq overhead in autograd
Following refactoring of the autograd in #333, we got a huge perf an memory usage improvement by not creating a graph at all when in inference mode: 21% on example 06 Shakespeare text generation.
This is probably due to nodes created during inference not being cleaned up and choking the GC.
However, isolating the proc in charge of registering nodes into the graph foo_cache
and register_node
highlighted a critical overhead.
See how much time linear_cache, sigmoid_cross_entropy_cache, relu_cache takes compared to the proc, with a special attention to the register_node
proc
The cache procs are pretty much self-contained, for example for relu_cache:
proc relu_cache[TT](result: Variable[TT], a: Variable[TT]) =
# Gate
var gate: ReluActivation[TT]
new gate
gate.cache = result.value
# Result setup
result.grad = zeros_like(result.value)
result.requires_grad = true
# Add to graph
register_node(
"Relu",
gate,
relu_backward_ag[TT],
result,
a
)
And register_node
func push[TT](ctx: Context[TT], node: Node[TT]) {.inline.}=
## Append a new operation to the context
ctx.nodes.add(node)
func register_node*[TT](
name: static string,
gate: Gate[TT],
backward: Backward[TT],
result: Variable[TT] or seq[Variable[TT]],
parents: varargs[Variable[TT]]) =
## Add an operation / gate as a new node in the computation graph
var node: Node[TT]
node.gate = gate
node.backward = backward
node.parents = @parents
when result is Variable:
node.payload = Payload[TT](kind: pkVar, variable: result)
else:
node.payload = Payload[TT](kind: pkSeq, sequence: result)
when defined(debug):
node.name = name
parents[0].context.push node
An alternative design to explore is that instead of having the parent Variables own their child result, we could have the children result own their parents. (see article: https://upcoder.com/17/who-owns-who).
This also maps cleanly to having the root being the final loss function and the inputs being the leaves. See colah's blog:
This may workaround https://github.com/nim-lang/Nim/issues/9974. Furthermore this would allow the GC to collect intermediate variables that are not used by a live computation result.
We would still need a way to do the operations in reverse in the proper order and context as a seq is the easiest but then it would also be a "co-owner" of each Variables. Otherwise we don't save order and introduce a topological sort phase before backprop. (Can we use something like intrusive singly linked list instead?)
One way to solve context ownership would be to store variables in a context, and only use indexes in that context for parents/child relationship.
Another area to explore would be optimising the payload. Currently it just naively references the result variable. It's no copy but it does increase refcounting. It could be a pointer to the gradient field instead (but we now have https://github.com/nim-lang/Nim/issues/9974 again).
Un unrelated optimisation would be like random
and the rng
state, having a global default context for autograd. It should only be initialised on first use (using {.global.} pragma) so that people only using the tensor
part of Arraymancer don't have spurious (though tiny) costs.