autochrome
autochrome copied to clipboard
ClojureScript support
I'd be interested in using this library for https://github.com/Day8/re-frame-10x to diff data structures. The README notes that ClojureScript support would be difficult, what kinds of things would be needed to accomplish this?
When I wrote that, what I had in mind was my pervasive use of IdentityHashMap. I thought that since JS has no way to get the "memory address" of an object, that it would be annoying to create something that works like IdentityHashMap. But it turns out that ES6 Maps work exactly that way when you use objects as keys and would be perfect for the task. Besides that there is also java.util.PriotityQueue but surely there exists a js/cljs priority queue library which could be used. Also I bet I used Java string methods instead of clojure.string everywhere.
The parser is probably not too hard to port to cljs but if you want to diff data only, you may not need the parser at all. In theory you could just provide a different implementation of autochrome.tree. That was the idea behind that namespace anyway.
Something else that might come up when diffing data as opposed to code is that you may want some sort of elision. If you change only one thing deeply nested inside some huge structure, it would be great to see the parts that didn't change shown as .... This does not exist yet which is why autochrome does not show structural diffs for .edn files even though it very well could.
All that said though, porting to CLJS, better supporting use as a library, and implementing elision are all things which I think would be awesome and am very excited about!
I'm currently starting a port for CLJS, do you have any work in progress for it? If not, I can link my branch here :)
No, I never got started, but I remain very excited about the possibility! Please link @mauricioszabo !
Ok, its still on the very beginning: https://github.com/mauricioszabo/autochrome/tree/port-to-cljs
One thing that I'm having trouble with: the code you're currently using uses side-effects to generate the result, right? Can you describe, in a "macro way" maybe, how the code works?
If you consider object identity to be a side-effect then yes it does depend on side-effects. At a macro level it is just Dijkstra's algorithm; where each node is a "cursor", or a path into the tree structure, and each edge is a tree transformation (insert item, delete item, add parens, strip parens). The paren-manipulating edges are what make it a tree diff as opposed to a regular sequence diff. The other tricky part is that when you are diffing a file, you don't know if things were renamed or moved. So you can write a (tree-diff A B) function, but it's not that useful because you don't know what B goes with each A.
So the way it works (see align.clj for the code) is like this:
- You have a bag of source forms and a bag of target forms. "Bag" because order doesn't matter.
- For each source form
s:- Ask Dijkstra to find you a path, from
sto any target form, or the empty form (nil). - Whichever target form that Dijkstra returns is the one which has the lowest-cost diff against
s.- If it was a target form
t, thensandtare a matched pair, andtcan be removed from the bag of target forms. - If it was the empty form, then declare that
swas deleted.
- If it was a target form
- Ask Dijkstra to find you a path, from
There is a more detailed explanation of this in the Alignment section in the HTML readme.
There are side-effects in the dforms function but the only important ones are the standard Dijkstra priority queue and cost map. The atoms declared with the comment for difflog are only important for generating the rich log of the diff process which was crucial to generate the detailed algorithm explanation in the HTML readme. You can safely remove them if you are not interested in producing similar logs.
Throughout the whole diffing process, you have to consider tree node identity as opposed to equality. This is because you may have several tree nodes which are equal but not the same node, for instance if you are trying to diff [a a a] against [a a B a], you need to know which a you are currently looking at, otherwise the diff algorithm won't work. You also want to attach the diff metadata (added/deleted) to the right node so that the caller can paint the right nodes green/red. My implementation makes liberal use of identityHashCode for this, but another approach would be to do a first pass over the tree and attach a unique ID to every node, which could be used in the same way. I think the reason that I didn't do it that way is mainly "for performance" but also because it would mean that the renderer would need to take the tree-with-IDs as input which would be slightly annoying pass around like that.
To be honest, I was interested on a more "higher level" detail. When I asked about side-effects I was referring about this code, for example:
https://github.com/mauricioszabo/autochrome/blob/port-to-cljs/src/autochrome/core.clj#L25-L31
Like, the @diff/nprocessed is swapped, or reseted from other namespaces and then used for another things... what I wanted to know is the flow of your code: like, what each namespace do, how do you create and process the diff, etc :)
Sorry, I guess I misunderstood the question. The @diff/nprocessed atom is one of those that i mentioned that are declared wiht a comment "for difflog", they are only important for collecting statistics and do not impact the output in any way. In particular nprocessed is just the total number of states that we popped off the PQ across all of our Dijkstra's algorithm runs. The print statement is just bragging about how hard we had to work to generate your diff ;). Outside of stuff like that for reporting or debugging, there should be no mutable state that escapes a function call.
Invocation
- core
- basically just the entrypoint
- page
- actually generates the page; responsible for pulling the inputs (indirectly), calling the diff itself (indirectly) and rendering the outermost structure of the page. It does stuff like decide if there should be two panes or one for a file (only one pane needed if the whole file was added/deleted).
- github
- This ns is poorly named in retrospect, it is mostly responsible for pulling the source code that we are going to feed in as the left/right side of the diff. It has some awful code to apply patches to recover the left side of a particular PR diff on github, but it can also fetch from local git or just the filesystem
Serious work
- parse
- This a custom whitespace-preserving clojure parser/renderer. The parser that comes with clojure is not interested in whitespace, and also does things like expands syntax-quote expressions and qualifies some names at parse time, which makes it unsuitable for diffing where we need to parse into an AST, do the diff, and then re-serialize the exact original source code text from our AST. The
parsefunction should work for any clj(s) code and(comp render parse)should be the identity function over the domain of valid clj(s) code strings, i.e. it always roundtrips
- This a custom whitespace-preserving clojure parser/renderer. The parser that comes with clojure is not interested in whitespace, and also does things like expands syntax-quote expressions and qualifies some names at parse time, which makes it unsuitable for diffing where we need to parse into an AST, do the diff, and then re-serialize the exact original source code text from our AST. The
- tree
- this is currently just a regular ns but in theory it is more like a protocol, it has functions for all of the tree-structure operations that we are going to need, and you would be able to diff other sorts of trees as long as you can implement these functions. Currently it is only implemented for the AST format produced by the custom parser. The point is so that the diff algorithm itself isn't complected with the AST format.
- diff
- The actual path finding diff algorithm implementation
- align
- The alignment algorithm implementation, entry point for callers who want to diff files and don't already know the exact forms they want to diff against each other.
Rendering
- components
- this is mostly about rendering source code with line numbers, syntax highlighting, etc
- annotation
- the output of the diff is the IdentityHashMap that tells you how each subtree pointer was annotated, this ns has stuff which basically applies this to the tree by
assocthe annotations onto the tree nodes so that further rendering (which no longer needs pointer equality to work) doesnt have to worry about IdentityHashMap garbage.
- the output of the diff is the IdentityHashMap that tells you how each subtree pointer was annotated, this ns has stuff which basically applies this to the tree by
- xref
- this is for linking certain symbols to their documentation in the rendered html
- scope
- This is a rudimentary tree visitor/walker for the
parseAST format, it understands some common special forms and it's job is to know, at any point in the source code tree structure, which names are in local scope, and therefore should not be highlighted/linked as if they were clojure.core functions. In the context of this project it's a lot of work for little gain, it's mostly an artifact of autochrome's history as an outgrowth of static analysis / linting efforts at ladderlife. I thought that I might as well include it because it does make the syntax highlighting more correct; basically it's a cool feature that you are very unlikely to see in any competing product ;)
- This is a rudimentary tree visitor/walker for the
- styles
- My garbage CSS
- components
- This is basically a version of
parse/renderthat produces HTML DOM instead of strings, in other words it renders theparseAST format to html, while handling annotations that were attached byannotationns.
- This is basically a version of
- common
- some random stuff that it felt bad to duplicate, mostly related to the parser
- readme and difflog
- these are for the html readme and are technically unimportant otherwise, however the difflog feature is invaluable for debugging the algorithm should it ever be changed