syntax_suggest icon indicating copy to clipboard operation
syntax_suggest copied to clipboard

WIP Tree and heuristic based re-write

Open schneems opened this issue 3 years ago • 0 comments

This effort was originally started in response to #118 which I kept a log on experiments there for some time. The core goal is more performance. With a lot of benchmarking and optimizing it seems that the best way to get more performance is to make an algorithm that builds better guesses.

I then split out several promising branches with experiments and I've finally converged to this branch.

Beyond performance, one issue with the existing code is that searching for syntax errors is heavily coupled to the process of transforming the document. It searches as it transforms. There are two main abstractions CodeLine and CodeBlock. Both are mostly data values and don't give much semantic meaning to their contents. This is especially true when trying to optimize this codehttps://github.com/zombocom/dead_end/blob/622dfddda92e27235425dd5a370618c5257731e8/lib/dead_end/parse_blocks_from_indent_line.rb#L37. Basically we can try to make intelligent guesses, but sometimes we're wrong and when that happens we have already mutated our source so there's not an easy way to "go back".

Intelligent code structure: graph of nodes

This re-write splits out the searching from structure generating. It first builds a graph of nodes using indentation and the heuristic of which way code is "leaning" (i.e. def talk is incomplete and is leaning left, it needs to expand right/down to find a match). Each node is an attempt at building a valid code block. It also contains references to its "parents" one or more code blocks that it may hold. When generating nodes, the parent blocks are created first and then recursively joined to create children. Nodes also hold a reference to nodes both above and below them. This is an especially useful property as some code is not valid without a surrounding context. For example hello: "world" is not syntactically valid ruby code, but speak(hello: "world") is valid.

Diagnose nodes

Once the graph is created, then it can be searched. The entire document's worth of nodes is sent into a Diagnose class where it suggests how we could move forward to isolate the problem. This logic is heavily coupled to the shape of graph that's generated. For example, if we are diagnosing a node with 3 parents, we could systematically try validating the syntax while removing 1 parent at a time to see if it's removal makes the other parents valid. That might indicate (but not prove) that the removed node holds the problem.

Search the diagnosis

Once nodes are generated and a robust set of common problems can be diagnosed then searching the graph becomes somewhat straightforward. The path to eliminating all syntax errors is not linear so must be represented by a branching structure. Each walk of the graph is encoded into a Journey, and only when all possible journeys have ended successfully can the original document become parsable (indicating that all syntax errors are found/eliminated). The steps in this process are preserved versus in the current algorithm the searching process is destructive. If a bad guess is made, we could in theory detect it somehow and roll back.

Next

The performance of this approach is above and beyond my expectations. It is much faster. It's also possibly more accurate. However, it also produces different problems than the original. Right now I'm able to get most test cases passing with the same or better-than-original output. The edge cases where that's not true need to be investigated.

Some old properties like this problem code:

deffoo
  print 'lol'
end

Being reduced to:

deffoo
end

Were properties of the search algorithm itself as the v1 algorithm transforms/hides code as it searches. Since this algorithm is different it will tell you correctly that the problem is the end but we must figure out how to teach it that the deffoo is related, but that the interior print 'lol' is not.

Essentially it's a process of picking a failing case, trying to modify one of the algorithms mentioned above, or possibly adding some pre or post-processing step not yet present.

This progress is still very much considered experimental.

schneems avatar Jun 04 '22 13:06 schneems