antlr4
antlr4 copied to clipboard
Enhancement - Antlr5 - roadmap / explorations / machine translation using neural nets
I've been reviewing tensorflow recently and wanted to share this idea. It is most likely not suitable for this repo/ but not sure a better place to log this.
At the essence, think of one day translating code from one language to another in as simple a way as using google translate. This is actually very doable today using tensorflow + syntaxnet.
What follows could be one approach. Although more recent developments in dragnn could probably be factored in. Dragnn at its core builds its own gramma files. (Haven't fully got my head around how this could be seeded)
Consider that, Given a gramma file Some how / programmatically generate valid code in multiple languages. (Of use precaanned ones from Wikipedia ) Parse the AST in a super language (swift) Train the net on this AST representation(s) Given this AST in this language the corresponding AST in this other language is.... Use DCGAN to forge valid /compilable code.
Consider that this trained model would be thrown away each month with competing models.
Why do you miss in ANTLR 4 for implementing such translator?
In all honesty, it's an unknown. Tackling this problem alone is not my intention. But consider having support of industry to help. Perhaps a competition to translate the code would be appropriate. This way it would flesh out problem to find out what's missing. It could be sponsored by IBM / google / Microsoft / nvidia / intel.
The training data is important / but being able to programmatically generate code maybe critical to feed back into training.
Is there a tool that can write CFG for some language X, then generate a code generator, that can translate AST that generated by Antlr4, to language X? Currently?
Or the code generated part can only be written by hand?
We (@PositiveTechnologies) use a unified AST (UST) in our open source Pattern Matching Engine PT.PM. UST obtained by converting an ANTLR parse tree which obtained from the parser and thus from the grammar.
Also we are developing a new proprietary engine for analyzing data flows on unified AST. For this UST being converted to CFG, to PDG, and to combined representation CPG (UST + CFG + PDG).
So, you can use the first project as a base for unified CFG.
updates in this space https://github.com/src-d/code2vec - fyi @zurk
Hello, guys. Yes, we are implementing this article: https://arxiv.org/pdf/1803.09473.pdf using our own tooling like https://github.com/src-d/ml (for machine learning on the source code) and https://github.com/bblfsh/bblfshd/ to get Universal AST to be able to work with all languages in the same way.
What about google translator for the code it is a really cool idea and can be possible in some cases, but have a lot of underwater rocks. For example, we have sort
function in many languages and in some of them you have nan
value at the beginning of the list, in other cases in the end. Ok, have fun :) It can change your program behavior completely.
P.S.: You guys have a really good tool!
Some work by Pengcheng Yin @pcyin & Graham Neubig
A Syntactic Neural Model for General-Purpose Code Generation https://arxiv.org/abs/1704.01696 We consider the problem of parsing natural language descriptions into source code written in a general-purpose programming language like Python. Existing datadriven methods treat this problem as a language generation task without considering the underlying syntax of the target programming language. Informed by previous work in semantic parsing, in this paper we propose a novel neural architecture powered by a grammar model to explicitly capture the target syntax as prior knowledge. Experiments find this an effective way to scale up to generation of complex programs from natural language descriptions, achieving state-of-the-art results that well outperform previous code generation and semantic parsing approaches.
This paper proposes a syntax-driven neural code generation approach that generates an abstract syntax tree by sequentially applying actions from a grammar model.
https://github.com/pcyin/NL2code
@RaphaelOlivier - also noteworthy. https://github.com/RaphaelOlivier/sempar-codgen
contributions by @sriniiyer code + paper - Summarizing Source Code using a Neural Attention Model - CODENN https://github.com/sriniiyer/codenn https://github.com/sriniiyer/codenn/blob/master/summarizing_source_code.pdf
I've also been investigating this possibility. Just came across this repo: https://github.com/pcyin/tranX
related - natural language to executable code https://github.com/pcyin/NL2code https://arxiv.org/abs/1704.01696
fyi @pcyin / @neubig
UPDATE - https://github.com/github/CodeSearchNet
- backward access
- reference subtyped rule (# marked rule)
- rule as set, support UNION, EXCLUDE and other operators
@inshua Could you elaborate on these?
@inshua Could you elaborate on these?
- backward access
- reference subtyped rule (# marked rule) i.e. VB support this syntax
For i = 1 To 10
For j = 1 To 10
...
Next j, i ' close both j and i
I have solved it, but if ANTLR support backward reference it will be better
nextStmt:
NEXT # OnlyNext
| NEXT identifier #NextId
| nextStmt#NextId(-1) identifier # NextIdMore // backward1 is NextId
| nextStmt#NextIdMore(-1) identifier # NextIdMore2
;
Here I show reference subtyped rule
too, they are nextStmt#NextId
and nextStmt#NextIdMore
.
- rule as set, support UNION, EXCLUDE
we can write rule as
rule : (rule1 | rule2);
It's good, but if we treat rule as set, it just equivs rule1 UNION rule2
, we should support rule1 SUBTRACT rule2
, rule1 AND rule2
too. i.e.
// wrong rules, just for a presentation
multi: '*' | '/';
op : '+' | '-' | multi;
multiExpr: expr multi expr;
expr: multiExpr (op - multi) multiExpr; // `op - multi` got '+' '-'
And rule1 AND rule2
are very useful too.
rule as set, support UNION, EXCLUDE
Not sure such functionality should be integrated into ANTLR. It's out of EBNF and it's hard to imagine cases where it's required.
rule as set, support UNION, EXCLUDE
Not sure such functionality should be integrated into ANTLR. It's out of EBNF and it's hard to imagine cases where it's required.
yes, it's very useful, like subtyped rules. I'll post more cases when I met new.