WIP: homomorphic setting
The current implementation of chain's initialization does shape inference that guesses too much. For instance, relu >> relu >> relu, initialized with X and Y, would define the whole network and set all nO's to Y's shape, which is not necessarily what the user wants. You could argue that in this case, the network is simply underspecified and should be specified better upon declaration.
Guessing hidden widths also meant we made mistakes. With an ensemble textcat with inline transformer, two embedding layers are concatenated. But if the embedding width of the transformer isn't known upon initialization, the second embedding layer would just receive its nO from Y, which is clearly wrong.
This PR restricts transferring Y across the network, but introduces a "homomorphic" setting instead that should at least allow us to do better inference for layers where nI==nO.
TODO
- [ ] Think about versioning of
chainas this change is breaking acrossspacyandspacy-transformers, but those libraries don't actually usechain.v1, they just importthinc.api.chain:| - [ ] Look at unit test in notebooks: example 01
- [ ] Look more into the issue of inline transformer+textcat, as this might represent a use-case that requires a more complex solution to this problem
- [ ] Find a way to specify relations between dimensions in a network architecture, so that all related dimensions are set when one of them is specified?
- [ ] Test spaCy & spacy-transformers thoroughly against this branch
The failing test is about the Jupyter notebook example 01 which states:
Some combinators work on a layer and a numeric argument. For instance, the clone combinator creates a number of copies of a layer, and chains them together into a deep feed-forward network. The shape inference is especially handy here: we want the first and last layers to have different shapes, so we can avoid providing any dimensions into the layer we clone. We then just have to specify the first layer's output size, and we can let the rest of the dimensions be inferred from the data.
and runs
model = clone(Linear(), 5)
model.layers[0].set_dim("nO", n_hidden)
model.initialize(X=X, Y=Y)
This network is underspecified on purpose. So we need to decide whether we still want to support this (as we used to) or not (as this PR proposes). IMO, after reflecting upon this, I think the original code is not good practice. Y's output dimension will be propagated to all the intermediate Linear layers, which results in a pretty weird/atypical/possibly unuseful architecture overall?