antlr4 icon indicating copy to clipboard operation
antlr4 copied to clipboard

Enhancement - Antlr5 - roadmap / explorations / machine translation using neural nets

Open johndpope opened this issue 7 years ago • 15 comments

I've been reviewing tensorflow recently and wanted to share this idea. It is most likely not suitable for this repo/ but not sure a better place to log this.

At the essence, think of one day translating code from one language to another in as simple a way as using google translate. This is actually very doable today using tensorflow + syntaxnet.

What follows could be one approach. Although more recent developments in dragnn could probably be factored in. Dragnn at its core builds its own gramma files. (Haven't fully got my head around how this could be seeded)

Consider that, Given a gramma file Some how / programmatically generate valid code in multiple languages. (Of use precaanned ones from Wikipedia ) Parse the AST in a super language (swift) Train the net on this AST representation(s) Given this AST in this language the corresponding AST in this other language is.... Use DCGAN to forge valid /compilable code.

Consider that this trained model would be thrown away each month with competing models.

johndpope avatar May 10 '17 21:05 johndpope

Why do you miss in ANTLR 4 for implementing such translator?

KvanTTT avatar May 13 '17 20:05 KvanTTT

In all honesty, it's an unknown. Tackling this problem alone is not my intention. But consider having support of industry to help. Perhaps a competition to translate the code would be appropriate. This way it would flesh out problem to find out what's missing. It could be sponsored by IBM / google / Microsoft / nvidia / intel.

The training data is important / but being able to programmatically generate code maybe critical to feed back into training.

johndpope avatar May 14 '17 02:05 johndpope

Is there a tool that can write CFG for some language X, then generate a code generator, that can translate AST that generated by Antlr4, to language X? Currently?

Or the code generated part can only be written by hand?

linonetwo avatar May 16 '17 12:05 linonetwo

We (@PositiveTechnologies) use a unified AST (UST) in our open source Pattern Matching Engine PT.PM. UST obtained by converting an ANTLR parse tree which obtained from the parser and thus from the grammar.

Also we are developing a new proprietary engine for analyzing data flows on unified AST. For this UST being converted to CFG, to PDG, and to combined representation CPG (UST + CFG + PDG).

So, you can use the first project as a base for unified CFG.

KvanTTT avatar May 16 '17 14:05 KvanTTT

updates in this space https://github.com/src-d/code2vec - fyi @zurk

johndpope avatar Jul 10 '18 14:07 johndpope

Hello, guys. Yes, we are implementing this article: https://arxiv.org/pdf/1803.09473.pdf using our own tooling like https://github.com/src-d/ml (for machine learning on the source code) and https://github.com/bblfsh/bblfshd/ to get Universal AST to be able to work with all languages in the same way.

What about google translator for the code it is a really cool idea and can be possible in some cases, but have a lot of underwater rocks. For example, we have sort function in many languages and in some of them you have nan value at the beginning of the list, in other cases in the end. Ok, have fun :) It can change your program behavior completely.

P.S.: You guys have a really good tool!

zurk avatar Jul 12 '18 11:07 zurk

Some work by Pengcheng Yin @pcyin & Graham Neubig

A Syntactic Neural Model for General-Purpose Code Generation https://arxiv.org/abs/1704.01696 We consider the problem of parsing natural language descriptions into source code written in a general-purpose programming language like Python. Existing datadriven methods treat this problem as a language generation task without considering the underlying syntax of the target programming language. Informed by previous work in semantic parsing, in this paper we propose a novel neural architecture powered by a grammar model to explicitly capture the target syntax as prior knowledge. Experiments find this an effective way to scale up to generation of complex programs from natural language descriptions, achieving state-of-the-art results that well outperform previous code generation and semantic parsing approaches.

This paper proposes a syntax-driven neural code generation approach that generates an abstract syntax tree by sequentially applying actions from a grammar model.

https://github.com/pcyin/NL2code

@RaphaelOlivier - also noteworthy. https://github.com/RaphaelOlivier/sempar-codgen

johndpope avatar Sep 18 '18 15:09 johndpope

contributions by @sriniiyer code + paper - Summarizing Source Code using a Neural Attention Model - CODENN https://github.com/sriniiyer/codenn https://github.com/sriniiyer/codenn/blob/master/summarizing_source_code.pdf

johndpope avatar Sep 27 '18 16:09 johndpope

I've also been investigating this possibility. Just came across this repo: https://github.com/pcyin/tranX

bitnom avatar May 15 '19 01:05 bitnom

related - natural language to executable code https://github.com/pcyin/NL2code https://arxiv.org/abs/1704.01696

fyi @pcyin / @neubig

UPDATE - https://github.com/github/CodeSearchNet

johndpope avatar Jul 08 '19 00:07 johndpope

  • backward access
  • reference subtyped rule (# marked rule)
  • rule as set, support UNION, EXCLUDE and other operators

inshua avatar Jul 13 '22 06:07 inshua

@inshua Could you elaborate on these?

KvanTTT avatar Jul 13 '22 10:07 KvanTTT

@inshua Could you elaborate on these?

  • backward access
  • reference subtyped rule (# marked rule) i.e. VB support this syntax
For i = 1 To 10 
  For j = 1 To 10
    ...
Next j, i  ' close both j and i

I have solved it, but if ANTLR support backward reference it will be better

nextStmt:
    NEXT    # OnlyNext
    | NEXT identifier   #NextId
    | nextStmt#NextId(-1) identifier  # NextIdMore   // backward1 is NextId
    | nextStmt#NextIdMore(-1) identifier # NextIdMore2
;

Here I show reference subtyped rule too, they are nextStmt#NextId and nextStmt#NextIdMore.

  • rule as set, support UNION, EXCLUDE

we can write rule as

rule : (rule1 | rule2);

It's good, but if we treat rule as set, it just equivs rule1 UNION rule2, we should support rule1 SUBTRACT rule2, rule1 AND rule2 too. i.e.

// wrong rules,  just for a presentation
multi: '*' | '/';
op : '+' | '-' | multi;
multiExpr: expr multi expr;
expr: multiExpr (op - multi) multiExpr;   // `op - multi` got '+' '-'

And rule1 AND rule2 are very useful too.

inshua avatar Jul 29 '22 03:07 inshua

rule as set, support UNION, EXCLUDE

Not sure such functionality should be integrated into ANTLR. It's out of EBNF and it's hard to imagine cases where it's required.

KvanTTT avatar Jul 29 '22 10:07 KvanTTT

rule as set, support UNION, EXCLUDE

Not sure such functionality should be integrated into ANTLR. It's out of EBNF and it's hard to imagine cases where it's required.

yes, it's very useful, like subtyped rules. I'll post more cases when I met new.

inshua avatar Aug 05 '22 03:08 inshua