StatsModels.jl [speculative] What does it look like to use StatsModels with a neural net?

This is something we've talked about a bit, Now to be openning an issue to collect those thoughts. At some point this might become a package. We are going to think about using it with Flux, just for the sake of having a concrete example.

Input and output both should be matrixes, even if there is only 1 column in the output. #110
Observations should be in the last dimension: https://github.com/JuliaStats/StatsModels.jl/issues/115
a&b and a+b and a*b should all be treated the same, as all terms can interact in NNs
Categorical variables should be encoded in FullDummy, i.e. onehot. Ideally they should end up as Flux.onehot vectors, which transform them into indexing operations whey they are multiplied with the weight matricies.

What else?

Jun 13 '19 15:06 oxinabox

So here are some random thoughts I've had. Quality varies so take these ideas as purely speculative at this point (not necessarily strong suggestions).

I think it could open up a lot of opportunities if formulas could be chain terms together in various ways, similar to how Turing.jl allows a series of equations using @model.

So the classic Chain(Conv(...), Conv(...),...) could also be written as a series of equations:

@formula begin
    layer1 ~ Conv(training_images....)
    layer2 ~ Conv(layer1,....)
end

I know this is more verbose, but it also leaves room for several interesting ways of intuitively customizing other aspects of a neural network.

Control over kernel weights

The following syntax is probably not ideal, but with formulas you could precondition weights on certain distributions.

@formula begin
    layer1_kernel_weights ~ Kernel(Normal(), (2,2), channel1 => channel2)
    layer1 ~ Conv(training_images, layer1_kernel_weights)
    layer2 ~ Conv(layer1,...)
end

Treating the kernels as statistical weights and a neural network layer as a traditional statistical model has the possibility of also doing a lot of very simple transfer learning because you could just take another models weights m2 = @formula(newlayer~Conv(training_images, weights(m1))).

Simple topology manipulation

Because formulas allow defining symbols relationships and not just chaining together layers you can easily do a bunch of topological manipulation without new custom layers. So a DenseNet could be something like:

@formula begin
    layer1 ~ training_variables
    layer2 ~ layer1
    layer3 ~ layer1 + layer2
end

This is nice because it doesn't require any novel syntax for implementation and the concatenation aspect is using the same syntax you'd typically use in a formula.

Per layer loss functions and gradient updates

Again this syntax is in no way polished, but being able to easily specify per layer loss functions would be pretty interesting.

@formula(ŷ ~ x| abs(ŷ-mean(y)))

Maybe a grad update could be done with something like Δ to specify the optimiser for back propagation.

@formula(y ~ x Δ ADAM)

Jun 14 '19 11:06 Tokazama

@Tokazama that is a cool idea, but I think it is outside the scope of StatsModels.jl StatsModels @formula is a DSL for feature engineering. Where as that is a (nifty) DSL for model definition. Like seriously don't stop with this line ouf thought, but I don't think it belongs in this package. (New package: NNModels.jl? EndToEndModels.jl?)

Jun 14 '19 11:06 oxinabox

Those are some very cool ideas. I tend to think of a formula as specifying a single many-to-many transformation, so chaining a bunch of those together to specify a NN topology certainly makes sense to me. I agree with @oxinabox that the specifics seem out-of-scope for statsmodels though; what's NOT out-of-scope is making sure that we don't build in overly-restrictive assumptions about how the abstractions we have here are going to be used in other packages.

Jun 14 '19 17:06 kleinschmidt

To @oxinabox original questions: the interaction stuff can be handled at the apply_schema stage (using the third argument for the model context), as can setting one-hot encoding as default.

The "always matrix" and obs-dims stuff is more complicated, and I think is part of the general problem of allowing formula consumers control over the destination container (e.g., sparse matrix, GPU array, row/column major).

Jun 14 '19 17:06 kleinschmidt