juice icon indicating copy to clipboard operation
juice copied to clipboard

New layer architecture

Open hweom opened this issue 4 years ago • 18 comments

New layer architecture prototype

  • 🧭 Architecture

Relates to #155 .

Changes proposed by this PR:

  1. Static network graph is separated from invocation context. a) Static graph captures layers, connections between them and shapes of the units of data. b) Invocation context specifies the batch size and stores all data associated with an invocation (data, gradients).
  2. Batch size is now explicit in the context instead of being implicitly extracted by layers from incoming data.
  3. Separation into Layer and ILayer is now gone, everything is now handled in layer implementations (with "leaf" layers focusing on data manipulations while container layers focusing on network composition).

Notes to reviewer:

This is still a very early prototype not intended for merging:

  1. Solver architecture not changed and just crudely hacked to support new network architecture.
  2. Shared weights not supported.
  3. Serialization not supported.
  4. Mnist example compiles and runs but doesn't converge (there is a bug somewhere, i'm sure).

A good order for exploring this PR is starting at comments in net/mod.rs, net/layer.rs, net/descriptor.rs and net/context.rs.

hweom avatar Mar 12 '22 18:03 hweom

I did an initial pass, it simplifies things on the user end, but what I see as plus, on the other hand side, it removes the ability to mix execution backends iiuc? I'll do another pass soon.

drahnr avatar Mar 21 '22 09:03 drahnr

it removes the ability to mix execution backends

You mean use different backends for net vs loss in Solver? Yeah, this was a shortcut that I did. All the changes to the solver part of the framework were more or less minimally invasive hacks. This was so I could use the buffer holding the network output directly as an input to the loss layer.

In principle, we can keep the ability to have different backed for loss layer through either of these approaches:

  • Do not store the Backend in Context and pass it as a separate arg to Layer::compute_output() and compute_gradients(). (In reality, there is little reason to store it there, except for parameter passing convenience.) Then we can have a single Context holding all the buffers but 2 Backends.
  • Have 2 separate Context instances (with separate Backends): one for net and another for loss. This would require somehow copying the network output data from one context to another: either via copying the buffer data bytes or (more intelligently) just injecting a buffer reference.

Or did you mean mixing backends in different invocations of the network? I think already nothing precludes that, as layers don't store Backend object internally.

hweom avatar Mar 23 '22 00:03 hweom

Changes proposed by this PR:

1. Static network graph is separated from invocation context.
   a) Static graph captures layers, connections between them and shapes of the units of data.
   b) Invocation context specifies the batch size and stores all data associated with an invocation (data, gradients).

As mentioned earlier, this makes passing different execution contexts more difficult from what I can see API wise, since the creation of the layers then would have to hold an Rc<B> to the backend.

Storing the associated data as part of the descriptor is not something that seems idiomatic. The descriptor becomes the owner of the actual learned weights iiuc.

A plus of this is, that all layers now have to use the same storage and not be backend specific and also allow things to extend more quickly to other serialization formats.

One use case that must be supported, is to load external network definitions that only share the same input and output dimensions. This allows to i.e. hotswap networks during runtime.

2. Batch size is now explicit in the context instead of being implicitly extracted by layers from incoming data.

I think this is the biggest gain in the new architecture.

3. Separation into Layer and ILayer is now gone, everything is now handled in layer implementations (with "leaf" layers focusing    on data manipulations while container layers focusing on network composition).

:+1:


This is the first pass, it generally looks very promising, I have to give it another pass in hopefully less than 24d from now :sweat_smile:

drahnr avatar Apr 13 '22 13:04 drahnr

As mentioned earlier, this makes passing different execution contexts more difficult from what I can see API wise, since the creation of the layers then would have to hold an Rc<B> to the backend.

I think I misunderstood your earlier. Are you saying that we can't create the network using backend B1 and then pass to it a Context which uses a different backend B2? The same limitation exists now too, though?

Spoiler: I'm toying with an idea of separating backend from context:

pub trait Layer<B: IBackend>: Debug {
    fn compute_output(&self, backend: &B, context: &mut Context);
}

which I think is cleaner (and Context has no use of backend internally anyway). Still the layer is locked to the backend used during creation, and I don't see an easy way around it unless we change it to something like:

pub trait Layer: Debug {
    fn compute_output(&self, backend: &dyn IBackend + LayerOps<f32>, context: &mut Context);
}

(the latter will not compile, but hopefully the idea is clear).

Storing the associated data as part of the descriptor is not something that seems idiomatic. The descriptor becomes the owner of the actual learned weights iiuc.

Well the descriptor is just a convenient way of exposing data from a Layer which the outside world needs to know about. The same can be done with trait functions (much like ILayer::learnable_weights() currently does). Descriptor helps reduce boilerplate code in layers.

The question of ownership is a bit fuzzy with Rc<RefCell<>>, but in my implementation for example Linear layer holds pointers to weights, not relying on the Descriptor:

pub struct Linear {
    // Weight (A) and bias (b) for the linear operation y = Ax + b.
    weight: Rc<RefCell<LearnableParams>>,
    bias: Rc<RefCell<LearnableParams>>,
}

One use case that must be supported, is to load external network definitions that only share the same input and output dimensions. This allows to i.e. hotswap networks during runtime.

This should be already be supported I think. At least I don't see any immediate issues.

This is the first pass, it generally looks very promising, I have to give it another pass in hopefully less than 24d from now sweat_smile

Thanks. I have some updates on my end which I hope to push in about a week. Some cleanups on network side, plus I'm looking into solvers, as I need Adam optimizer for my tasks.

hweom avatar Apr 14 '22 02:04 hweom

OK, pushed a refreshed version. I couldn't implement Adam since it requires squaring tensors, which is not supported by the backends currently, but I've added some placeholders for it in the new train module.

hweom avatar Apr 28 '22 01:04 hweom

Added Adam implementation, for now without backend support.

hweom avatar May 07 '22 23:05 hweom

Alright, this a significant chunk of work :heart: I'd like to discuss how we can move towards filling in the missing pieces and a path to getting the adjusted arch back to master.

drahnr avatar May 22 '22 07:05 drahnr

I think the remaining part should be pretty mechanical -- port other layers to the new infra, write unit tests, etc. I'm happy to do all of that, or we can split the work.

I think it's probably a good idea to commit the current work to a branch, maybe even split in several PRs, to make the review more manageable. The currently missing pieces can be committed as separate PRs into the branch. The branch will have old and new code alongside until everything is ported, after which old code will be deleted.

Do you still want to do an in-depth review of the core infra? I'd be definitely more comfortable if someone can double-check the file structure, names, etc.

hweom avatar May 22 '22 22:05 hweom

I'll get to that. One thing that came to mind was, bring ready to impl auto differentiation with the new arch. The old one was a bit clunky in that regard.

drahnr avatar May 27 '22 08:05 drahnr

bring ready to impl auto differentiation

Sorry, not sure what this means. Could you elaborate?

hweom avatar May 31 '22 00:05 hweom

That was supposed to be being - what I meant was, the API should be able to represent inference and training passes both separately and in one step

drahnr avatar Jun 04 '22 07:06 drahnr

Sorry, can you clarify this? I think it already does it.

Right now the API provides 2 types of abstraction: Network and Trainer (Trainer also operates on Network, but it utilizes a lower-level API of the top Layer):

  • Network is the API for using the net via Network::transform().
  • Trainer is the API for training the net via Trainer::train_minibatch(). The latter also returns the result of the forward pass.

Both APIs hide the low-level details like constructing a Context, pushing inputs into it, extracting outputs, etc.

hweom avatar Jun 11 '22 18:06 hweom

I think we can move forward with this large refactor. We could have a sync call if you'd like? Sorry for the delay(s)

drahnr avatar Jun 23 '22 05:06 drahnr

Sure, happy to have a call! I'm in PDT timezone, so it seems the acceptable overlapping time range is your evening and my morning. How about Jun 24, 19:00 Munich time? If it works, I can send a Google Meet invite.

hweom avatar Jun 24 '22 05:06 hweom

Sure, happy to have a call! I'm in PDT timezone, so it seems the acceptable overlapping time range is your evening and my morning. How about Jun 24, 19:00 Munich time? If it works, I can send a Google Meet invite.

That'd work, please drop to [email protected] - if you get a bounce ( I hope not) it's due some email forwarding service issues, which are hopefully dealt with by now :crossed_fingers:

drahnr avatar Jun 24 '22 07:06 drahnr

Hey :wave: - I created https://github.com/spearow/juice/tree/arch-refactor where we should land the changeset first. You should also have received an invite that allows you to create branches.

drahnr avatar Jun 27 '22 15:06 drahnr

How much do we want the RNN layer to be implemented in the new arch before switching to it? I'm looking into it, but it will likely require some extensive changes of the backend.

cudnnRNNForwardInference() is deprecated, and its replacement, cudnnRNNForward(), requires batch size-dependent descriptors. It's doable, but will likely take me quite some time.

As far as I can tell, the existing RNN implementation is not used anywhere. I'm not even sure it's implemented correctly.

hweom avatar Mar 26 '23 23:03 hweom

Sorry, this is an old PR, at this point superseded by all the recent ones. I used it to ask the question: https://github.com/spearow/juice/pull/159#issuecomment-1484253106 (my bad, probably should have asked directly).

hweom avatar Mar 28 '23 05:03 hweom