ProvingGround
ProvingGround copied to clipboard
Deep learning: why, how
Optimization vs deep learning
- We can optimize generation when this is in terms of known elements and equations.
- We can extend this by smoothing goals, for example.
- However, we can try to train for deep learning:
- directly from goals
- possibly relative to givens, for which we can plug in discovered lemmas.
One may also want deep learning just for speed, as (smoothing and) optimization can take long.
Why representation learning
- A form of representation learning is just passing through a low dimenstional layer, perhaps straightaway.
- However, the point of explicit representation learning is using synthetic problems, which are easier to set up with "correct" answers.
- In our situation, since we want embeddings of distributions our representations may not be a single embedding layer.
Transformer architecture, position data
Many useful ideas are in the BERT work and stuff building on this. Adapting this to our siuation:
- We can use position vectors rather than linearizing our generation trees.
- We are trying to predict the terms in the RHS of equations, which may be either nodes or built up by substitution in nodes.
- In the longer run we learn to use attention, based on both positions and actual terms.
- Using terms: A crucial difference in our case is our model should allow the use of a term from the statement as is, without necessarily recognizing it.
Base level: Premise selection
- For us a premise is a term on the right hand side of an equation, perhaps obtained after substitution (resolving).
- Since we resolve to different depths, we have premises at different depths.
- Similarly, the ingredients of the statement are terms and types at different depths.
- Crucially, we can use ingredients of the statement in the proof as they are, so our model includes probabilities for this.
- Further we predict not just individual premises, but groups that go together.
Refined model
- Data has positions for terms, which are used
- Use a multi-head attention.
- Output should have not just premises but predicted positions.
- It is better not to view the output as translation, with predicted position as well.
- Instead we view it as similar to a game, with outcome (score) and value; moves are the policy.
- A basic score is obtained by comparison between weights of terms and types; if we have formal equations everything has a positive weight.
- Unlike a game, we can make multiple moves - indeed the Bot framework is exactly for this.
- To take advantage of this, one should have multiple rewards, for example rewarding both
- gathering: proving a lot of results.
- hunting: proving hard results.
- This may, for example, lead to lemmas chosen based on non-triviality versus just simple statements.
- We have a further choice, for example in our base versus tangent split as well as in base-mixin - some terms are used to genearte recursively while other terms are used only in combination with those generated from the base.
- Outcome: We can measure outcomes by efficiency of proving - the (weighted) fraction of types that are proved/disproved for a given number of terms (with various cutoffs).
- Moves may not lead to immediate improvement in outcome, but we can follow a few moves ahead.
- Eventually we can have policy and value networks; to start with we carry out all moves but with resources allocated based on weights, and look directly at outcomes.
- Layered outcomes: We can give extra credit for types not proved/disproved with an earlier collection of terms.
Predicting outcomes, not moves
- Rather than trying to learn good moves directly, we should try to predict outcomes of our moves, then search for the best moves.
- This is similar to games, where we learn the value of a position, which is the expected outcome.
- We should also learn to predict other auxiliary quantities.
- In games we do predict moves to some extent, which gives our policy. This is used in non-monotonic searches. We may need some similar learning here too.