[PROPOSAL] Source code transformations
Idea comes from https://github.com/src-d/blog/pull/233#discussion_r209758185
- Title: Source code match/traverse/transform APIs
- Author(s): Alex, ?
- Short description: An overview of existing approaches to match/traverse/transform source code
- Categories: language analysis
- Deadlines: no
Table of contents
Very rough grounds that would be covered
OSS:
- Golang: go fix/go fmt -r
- Cpp: clang-tidy
- C: coccinelle
- Java: JTransformer
- Example-based refactorings: Java: error-prone/ Golang: eg
- Python: Bowler
Proprietary/from talks or papers (material)
- ClangMR/JavacFlume (see
details section below) - Semmle QL (only query e.g though non-OSS CLI)
with some basic examples of using each API, and conclusion why bblfsh is the best tool for us.
Management
This section will be filled by @campoy.
- State: (proposed | writing | written | published)
- Scheduled:
- Link to post:
Social Media
- Wording for tweet:
- Hashtags:
- Subreddits:
NOTE Please write in short lines so the review is easier to do.
Preliminary content comes from prev. blog pos
{{% center %}} … {{% /center %}}
## Technical details
Based on the internal success-story of C++ with [ClangMR tool](https://research.google.com/pubs/pub41342.html)
for matching/traversing/transforming Abstract Syntax Tree (AST) at scale, a similar tooling was built for Java.
{{% youtube ZpvvmvITOrk %}}
Project [Error-Prone](https://github.com/google/error-prone) is a compiler extension that is able to perform
arbitrary analysis on the *fully typed AST*. One thing to notice is that one can not get such input by using
only a parser even as advanced as [babelfish](https://doc.bblf.sh/). Running a full build would be
required in order to do things like symbol resolution. In the end, after running a number of checker plugins
Error-Prone outputs simple text replacements with suggested code fixes.
The project is open source and is well documented in a [number](https://research.google.com/pubs/pub38275.html)
of [papers](https://research.google.com/pubs/pub41876.html). Another closed source tool was built to scale
application of those fixes to the whole codebase, called JavacFlume — which I would guess looks something like
an Apache Spark job that applies patches in some generic format.
Here is an example of how a full pipeline looks for C++:
{{% grid %}}
{{% caption src="https://cdn-images-1.medium.com/max/4224/1*KpJ5fj4njR1HTDfzhLCQkg.png" title="ClangMR processing pipeline ilustration"%}}
“Large-Scale Automated Refactoring Using ClangMR” by
[Hyrum Wright](https://research.google.com/pubs/HyrumWright.html), Daniel Jasper, Manuel Klimek, [Chandler Carruth](https://research.google.com/pubs/ChandlerCarruth.html), Zhanyong Wan
{{% /caption %}}
{{% /grid %}}
Although it is not disclosed, an attentive reader might have noticed that **Compilation Index** part of the
pipeline is very similar to a [Compilation Database](https://kythe.io/docs/kythe-compilation-database.html)
in the open source Kythe project.
It might be interesting to take a closer look at the example of an API for AST query and transformation for C++.
### C++ Example
> *rename all calls to Foo::Bar with 1 argument to Foo::Baz, independent of the name of the instance variable,
> or whether it is called directly or by pointer or reference*
{{% grid %}}
{{% grid-cell %}}

{{% /grid-cell %}}
{{% grid-cell %}}
This fragment will invoke a callback function on any occurrence of the call to *Foo:Bar* with single argument.
{{% /grid-cell %}}
{{% /grid %}}
{{% grid %}}
{{% grid-cell %}}

{{% /grid-cell %}}
{{% grid-cell %}}
This callback will generate a code transformation: for the matched nodes it will replace the matching text of
the function name with the “Baz”.
Regarding code transformations in Java, **Error-Prone** has a similar low-level [patching API](http://errorprone.info/docs/patching)
that is very close to native AST manipulation API. It was found to have a steep learning curve similar to the
Clang, and thus pose a high entry barrier — even an experienced engineer would need few weeks before one can be
productive creating fix suggestions or refactorings.
{{% /grid-cell %}}
{{% /grid %}}
That is why a higher level API was built for Java: first as the separate [Refaster](https://research.google.com/pubs/pub41876.html)
project and then [integrated into Error-Prone](http://errorprone.info/docs/refaster) later.
So a usual workflow would look like — after running all the checks and emitting a collection of suggested
fixes, shard diffs to smaller patches, run all the tests over the changes and if they have passed, submit
patches for code review.
{{% center %}} … {{% /center %}}
{{% center %}}
##### Thank you for reading, stay tuned and keep you codebase healthy!
{{% /center %}}
Hey Alex, maybe I'm lacking knowledge here but the title doesn't mean anything to me. Could you make it more beginner friendly?
Thank you for feedback, Francesc! It's totally WIP as I'm just gaining confidence in existing tools in this field.
The plan is basically to cover some "state of the art" tools for AST transformation (AKA refactoring), so the learnings could be applied to Bblfsh UAST manipulation API.
How about the title along the lines of "Source code transformations"?
OSS:
- Golang: go fix/go fmt -r
- Cpp: clang-tidy
- C: coccinelle
- Java: JTransformer
- Example-based refactorings: Java: error-prone/ Golang: eg
- Python: Bowler, python google/pasta
- Multilanguage https://comby.dev
Proprietary/from talks or papers (material)
- ClangMR/JavacFlume
- Semmle QL (only query)
Source code transformations makes it much more clear to me, yeah. Let me know when you have a draft of the blog so I can review.
I'd be curious to see if we can make it so the blog doesn't feel like a series of tools, and instead there's a story tying everything up.
I'd be curious to see if we can make it so the blog doesn't feel like a series of tools, and instead there's a story tying everything up.
that is very useful feedback, thank you and please let me think more about that. I would expect that even initial draft will take some time though - but will post it here asap.
Thanks again.
@campoy One story I can think of is:
take simple-but-educational example(s) of some issue in the code as a motivation, and then go though implementing:
- a code to detect it
- a code to suggest a fix for it in each of those systems.
Due to differences in host languages it could be hard to pick a single example, so it can be adjusted a bit for each specific language, keeping it sufficiently high-level.
A Nice 🍒 on top could be finishing it with the link to a blog post on "how to wrap it as a lookout analyzer" from #249 .
WDYT?
I like it, even if we find an example that only works for a specific language it should be easy to get people from other language communities understand the point of the article.
Refactoring prolog code: https://pdfs.semanticscholar.org/b48b/bc30427ef7429db83e190f91a579442121b6.pdf
@bzz did you get a chance to start a draft ?
Very preliminary - this is fairly ambitious and requires a lot of research. I would expect a shareble draft early next year.
@bzz Trying to plan our blog schedule for the upcoming weeks. Did you get a chance to work on this draft ?
@vcoisne did some progress on research but not there yet. I will ping you as soon as have some results to share!
This is still in my backlog.
Two more interesting contenders added to the description:
- https://github.com/google/pasta for python
- https://comby.dev for assembly, Bash, C/C++, C#, Clojure, CSS, Dart, Elm, Elixir, Erlang, Fortran, F#, Go, Haskell, HTML/XML, Java, Javascript/Typescript, JSON, Julia, LaTeX, Lisp, OCaml, Pascal, PHP, Python, Ruby, Rust, Scala, SQL, Swift, Text