neuron
neuron copied to clipboard
separate the data parallelization from model parallelization
change backpropagate() to two versions (one is sequential in data, one is parallel in data)
Another workaround is to consider pass derivative as explicit outputs, and use aggregate to obtain overall gradient.
I like the second solution to handle data parallelization: It always keeps the status immutable.