New layer architecture
New layer architecture prototype
- 🧭 Architecture
Relates to #155 .
Changes proposed by this PR:
- Static network graph is separated from invocation context. a) Static graph captures layers, connections between them and shapes of the units of data. b) Invocation context specifies the batch size and stores all data associated with an invocation (data, gradients).
- Batch size is now explicit in the context instead of being implicitly extracted by layers from incoming data.
- Separation into Layer and ILayer is now gone, everything is now handled in layer implementations (with "leaf" layers focusing on data manipulations while container layers focusing on network composition).
Notes to reviewer:
This is still a very early prototype not intended for merging:
- Solver architecture not changed and just crudely hacked to support new network architecture.
- Shared weights not supported.
- Serialization not supported.
- Mnist example compiles and runs but doesn't converge (there is a bug somewhere, i'm sure).
A good order for exploring this PR is starting at comments in net/mod.rs, net/layer.rs, net/descriptor.rs and net/context.rs.
I did an initial pass, it simplifies things on the user end, but what I see as plus, on the other hand side, it removes the ability to mix execution backends iiuc? I'll do another pass soon.
it removes the ability to mix execution backends
You mean use different backends for net vs loss in Solver? Yeah, this was a shortcut that I did. All the changes to the solver part of the framework were more or less minimally invasive hacks. This was so I could use the buffer holding the network output directly as an input to the loss layer.
In principle, we can keep the ability to have different backed for loss layer through either of these approaches:
- Do not store the
BackendinContextand pass it as a separate arg toLayer::compute_output()andcompute_gradients(). (In reality, there is little reason to store it there, except for parameter passing convenience.) Then we can have a singleContextholding all the buffers but 2Backends. - Have 2 separate
Contextinstances (with separateBackends): one for net and another for loss. This would require somehow copying the network output data from one context to another: either via copying the buffer data bytes or (more intelligently) just injecting a buffer reference.
Or did you mean mixing backends in different invocations of the network? I think already nothing precludes that, as layers don't store Backend object internally.
Changes proposed by this PR:
1. Static network graph is separated from invocation context. a) Static graph captures layers, connections between them and shapes of the units of data. b) Invocation context specifies the batch size and stores all data associated with an invocation (data, gradients).
As mentioned earlier, this makes passing different execution contexts more difficult from what I can see API wise, since the creation of the layers then would have to hold an Rc<B> to the backend.
Storing the associated data as part of the descriptor is not something that seems idiomatic. The descriptor becomes the owner of the actual learned weights iiuc.
A plus of this is, that all layers now have to use the same storage and not be backend specific and also allow things to extend more quickly to other serialization formats.
One use case that must be supported, is to load external network definitions that only share the same input and output dimensions. This allows to i.e. hotswap networks during runtime.
2. Batch size is now explicit in the context instead of being implicitly extracted by layers from incoming data.
I think this is the biggest gain in the new architecture.
3. Separation into Layer and ILayer is now gone, everything is now handled in layer implementations (with "leaf" layers focusing on data manipulations while container layers focusing on network composition).
:+1:
This is the first pass, it generally looks very promising, I have to give it another pass in hopefully less than 24d from now :sweat_smile:
As mentioned earlier, this makes passing different execution contexts more difficult from what I can see API wise, since the creation of the layers then would have to hold an
Rc<B>to the backend.
I think I misunderstood your earlier. Are you saying that we can't create the network using backend B1 and then pass to it a Context which uses a different backend B2? The same limitation exists now too, though?
Spoiler: I'm toying with an idea of separating backend from context:
pub trait Layer<B: IBackend>: Debug {
fn compute_output(&self, backend: &B, context: &mut Context);
}
which I think is cleaner (and Context has no use of backend internally anyway). Still the layer is locked to the backend used during creation, and I don't see an easy way around it unless we change it to something like:
pub trait Layer: Debug {
fn compute_output(&self, backend: &dyn IBackend + LayerOps<f32>, context: &mut Context);
}
(the latter will not compile, but hopefully the idea is clear).
Storing the associated data as part of the descriptor is not something that seems idiomatic. The descriptor becomes the owner of the actual learned weights iiuc.
Well the descriptor is just a convenient way of exposing data from a Layer which the outside world needs to know about. The same can be done with trait functions (much like ILayer::learnable_weights() currently does). Descriptor helps reduce boilerplate code in layers.
The question of ownership is a bit fuzzy with Rc<RefCell<>>, but in my implementation for example Linear layer holds pointers to weights, not relying on the Descriptor:
pub struct Linear {
// Weight (A) and bias (b) for the linear operation y = Ax + b.
weight: Rc<RefCell<LearnableParams>>,
bias: Rc<RefCell<LearnableParams>>,
}
One use case that must be supported, is to load external network definitions that only share the same input and output dimensions. This allows to i.e. hotswap networks during runtime.
This should be already be supported I think. At least I don't see any immediate issues.
This is the first pass, it generally looks very promising, I have to give it another pass in hopefully less than 24d from now sweat_smile
Thanks. I have some updates on my end which I hope to push in about a week. Some cleanups on network side, plus I'm looking into solvers, as I need Adam optimizer for my tasks.
OK, pushed a refreshed version. I couldn't implement Adam since it requires squaring tensors, which is not supported by the backends currently, but I've added some placeholders for it in the new train module.
Added Adam implementation, for now without backend support.
Alright, this a significant chunk of work :heart: I'd like to discuss how we can move towards filling in the missing pieces and a path to getting the adjusted arch back to master.
I think the remaining part should be pretty mechanical -- port other layers to the new infra, write unit tests, etc. I'm happy to do all of that, or we can split the work.
I think it's probably a good idea to commit the current work to a branch, maybe even split in several PRs, to make the review more manageable. The currently missing pieces can be committed as separate PRs into the branch. The branch will have old and new code alongside until everything is ported, after which old code will be deleted.
Do you still want to do an in-depth review of the core infra? I'd be definitely more comfortable if someone can double-check the file structure, names, etc.
I'll get to that. One thing that came to mind was, bring ready to impl auto differentiation with the new arch. The old one was a bit clunky in that regard.
bring ready to impl auto differentiation
Sorry, not sure what this means. Could you elaborate?
That was supposed to be being - what I meant was, the API should be able to represent inference and training passes both separately and in one step
Sorry, can you clarify this? I think it already does it.
Right now the API provides 2 types of abstraction: Network and Trainer (Trainer also operates on Network, but it utilizes a lower-level API of the top Layer):
-
Networkis the API for using the net viaNetwork::transform(). -
Traineris the API for training the net viaTrainer::train_minibatch(). The latter also returns the result of the forward pass.
Both APIs hide the low-level details like constructing a Context, pushing inputs into it, extracting outputs, etc.
I think we can move forward with this large refactor. We could have a sync call if you'd like? Sorry for the delay(s)
Sure, happy to have a call! I'm in PDT timezone, so it seems the acceptable overlapping time range is your evening and my morning. How about Jun 24, 19:00 Munich time? If it works, I can send a Google Meet invite.
Sure, happy to have a call! I'm in PDT timezone, so it seems the acceptable overlapping time range is your evening and my morning. How about Jun 24, 19:00 Munich time? If it works, I can send a Google Meet invite.
That'd work, please drop to [email protected] - if you get a bounce ( I hope not) it's due some email forwarding service issues, which are hopefully dealt with by now :crossed_fingers:
Hey :wave: - I created https://github.com/spearow/juice/tree/arch-refactor where we should land the changeset first. You should also have received an invite that allows you to create branches.
How much do we want the RNN layer to be implemented in the new arch before switching to it? I'm looking into it, but it will likely require some extensive changes of the backend.
cudnnRNNForwardInference() is deprecated, and its replacement, cudnnRNNForward(), requires batch size-dependent descriptors. It's doable, but will likely take me quite some time.
As far as I can tell, the existing RNN implementation is not used anywhere. I'm not even sure it's implemented correctly.
Sorry, this is an old PR, at this point superseded by all the recent ones. I used it to ask the question: https://github.com/spearow/juice/pull/159#issuecomment-1484253106 (my bad, probably should have asked directly).