Mocha.jl Interfacing to Mochas backpropagation algorithm

Mocha is really nice project and has backpropagation implemented for many different layer types and neurons. However what is the best way to interface it in way to obtain one large parameter vector? Is there a way around copy pieces to every blob? Currently, I am using copy(net.states[i].parameters[j].blob,slice) where slice is a slice of my big parameter vector.

This can be put together to a function that does the backpropagation given an array NNInds containg the indicies of the corresponding slices.

Copying the memory will produce an overhead, this should not matter for large networks, but there must be a better way.

# backpropagation
function evaluateNN(para,nPara,NNInds)
    # loss
    for i = 2:(length(net.states)-1)
        for j = 1:length(net.states[i].parameters)
            copy!(net.states[i].parameters[j].blob,
            para[NNInd[i-1][j][1] : NNInd[i-1][j][2]])
        end
    end
    val = forward(net, solver.params.regu_coef)
    backward(net, solver.params.regu_coef)
    # gradient
    gradient = zeros(nPara)
    for i=2:(length(net.states)-1)
        for j=1:length(net.states[i].parameters)
            gradient[NNInd[i-1][j][1] : NNInd[i-1][j][2]] =
                                            net.states[i].parameters[j].gradient.data[:]
        end
    end
    return (val, gradient)
end

Sep 04 '15 08:09 vollmersj

Unfortunately there is no better way because for obvious reason the gradients are stored separately. I'm quite curious though why do you want to get a flat huge vector? If you really want that, though, a easier way is to flatten each Param and then concat them all. The memory overhead cannot be avoided though.

Sep 06 '15 07:09 pluskid

Thank you for your response. There might be a way around the memory overhead by using pointers

x = zeros( 8 )
p = pointer_to_array( pointer( x, 3 ), (3,2) )
p[:,1] = 100.0
p[:,2] = 200.0
@show x    # => [ 0.0, 0.0, 100.0, 100.0, 100.0, 200.0, 200.0, 200.0 ]

Initialising the layer blobs would then require picking an appropriate chunk out of the memory. Would this be possible?

Having one parameter vector makes it easier to try different tuning algorithms.

Sep 06 '15 07:09 vollmersj

Yes, this is technically possible for could backend only. Though I doubt it will be a seriously issue because nowadays cpu memory is very large. If you have huge models, the bottleneck will then become the computation, esp when using a couple backend. For gpu backend, the memory is on the gpu device and cannot be directly shared with cpu.

Sep 06 '15 07:09 pluskid

For "cpu backend only", sorry I am on a phone and the auto correction is so crazy that it does not know cpu.

Sep 06 '15 07:09 pluskid