yard icon indicating copy to clipboard operation
yard copied to clipboard

Extra trailing bracket when default value is negative

Open xaviershay opened this issue 9 years ago • 23 comments

For the following method:

class A
  def a(b = -1)
  end
end

This documentation is generated:

screen shot 2015-08-15 at 10 11 19 am

Note the extra bracket: a(b = -1))

xaviershay avatar Aug 15 '15 17:08 xaviershay

If @param documentation is provided, the extra bracket also appears in the default value displayed there.

xaviershay avatar Aug 15 '15 17:08 xaviershay

Gritty internal details from yard/lib/yard/parser/ruby/ruby_parser.rb: This is because Ripper is calling the on_rparen callback before it calls on_unary. So by the time on_unary initializes the AST node for -1, charno has already been bumped past the ')' character.

alexdowad avatar Oct 24 '15 19:10 alexdowad

Hmm. on_rparen is a lexer (scanner) event, on_unary is a parser event.

The Ruby parser (which is reused for Ripper) is bison-generated and therefore should be a LALR(1). The "(1)" means "it uses lookahead of one token", which means that the lexer should always read one token past the end of any construct before reducing that construct into a parse node. Does that make sense?

I can see that the scanner event for the end keyword happens before the parser event for a new bodystmt node. But in parse.y (in the MRI source), I can see that the end keyword is not considered part of a bodystmt node. This seems to add some support to my theory.

alexdowad avatar Oct 24 '15 20:10 alexdowad

I am just experimenting with Ripper further. In some cases, it appears to lex ahead of a construct before reducing it to a new parse node. In other cases, as soon as it sees the last token of a construct, it reduces it to a new parse node immediately (without using lookahead).

I don't remember my parser theory clearly enough to know if this is the expected behavior for a LALR parser or not. But it is certainly plausible that the parser might need to use lookahead in some cases, but in other cases not.

It's clear that YARD's approach of using the lexer events to determine "how far" it has reached in a source file, and how much of the textual source code "belongs to" an AST node, is very brittle.

alexdowad avatar Oct 24 '15 20:10 alexdowad

It's clear that YARD's approach of using the lexer events to determine "how far" it has reached in a source file, and how much of the textual source code "belongs to" an AST node, is very brittle.

I can confirm this. The problem is that lexical tokens and AST nodes don't line up cleanly due to the order that Ripper emits events-- there's a lot of juggling and assumptions around the way source is reconstructed from these events. If someone knows a better way to track locations, I'm all for a rewrite of ruby_parser.rb.

lsegal avatar Oct 24 '15 20:10 lsegal

I've just been looking at Ripper and scratching my head, wondering how it can possibly be used to get reliable information about how source code text corresponds to AST nodes.

I don't think it can. That is a serious deficiency.

I also looked at Rubinius' parser, rubinius-melbourne. It's not suitable either.

However, it looks like the parser gem can parse Ruby code and return the needed information. I may take a crack at rewriting ruby_parser.rb using parser and see how well it works...

alexdowad avatar Oct 24 '15 21:10 alexdowad

I'm not a fan of bringing in unnecessary required dependencies, and it looks like parser is pure-Ruby, which I imagine will bring significant performance costs to parsing. Both of those things together kind of make it a non-starter (perf being the more important of the two).

lsegal avatar Oct 24 '15 21:10 lsegal

I think the first order of business is to refactor RubyParser into a shape where the underlying parser implementation can be easily changed. Right now the implicit dependencies on Ripper are just too strong.

The parser events generated by Ripper do not follow the semantic structure of Ruby code; rather, they very closely follow the productions recognized by Ruby's LR parser. If you take any parser event which is fired by Ripper and convert it into an AST node, you get a very messy tree which includes many extraneous nodes -- artifacts of the way Ruby's yacc grammar was written.

Unfortunately, this is exactly what YARD does. I would like to refactor RubyParser so that 1) it produces a tree with only AST nodes which are meaningful, needed, and useful, and 2) an examination of the code quickly reveals what the relevant node types are and how they nest within each other. (Right now you have to read the entire codebase to figure out which node types will actually be used for something!)

Once that is done, it will be relatively easy to swap out Ripper for something else, and assess the impact on performance. If you decide that the impact is too great, you can stick with Ripper. How does that sound?

alexdowad avatar Oct 25 '15 05:10 alexdowad

I guess I don't really agree with where effort and complexity is being placed on solving this problem. It seems like there are simpler ways to solve this in a reasonable way than rewriting 500+ LOC, negatively impacting performance, and introducing a number of breaking API changes, in order to fix an extraneously emitted close paren character.

More importantly, to the breaking changes point (and something I admittedly forgot to mention), is that Ripper cannot [easily] be ripped out of YARD. The entire handler architecture API is bound to it. Any change that removes Ripper would have to come with a compatibility layer to bring back pretty much all of the AST production nodes that Ripper produces, because that AST is the effective API for YARD's handler arch. Furthermore, those "extraneous" nodes actually are meaningful in many cases-- any AST that scrubbed out, for instance, how comments line up on columns, or removed extraneous whitespace nodes, could actually affect YARD's ability to pull comments from source-- or a plugin's ability to do the same. So, unfortunately, it's a non-starter to try and change the AST structure at this point in time.

To be honest, there are tremendous benefits to sticking with Ripper, even if it is occasionally awkward, due to the fact that (a) it is supported as part of Ruby's core codebase, which means it's not going away, and (b) the fact that it's coupled to the grammar means it will always reflect the current syntax of Ruby in an accurate way. You can't get that from a third-party library like ast, which may be slow to support new syntaxes. I don't see any big reason to move off of Ripper.

Refactor the code to be easier to maintain, sure, but I don't think switching libraries is an option right now.

lsegal avatar Oct 25 '15 06:10 lsegal

I've done some benchmarking and am convinced that none of the alternatives to Ripper have acceptable performance.

What I am referring to as "extraneous nodes" are things like void_stmt nodes, which you can completely strip out of the AST without breaking anything in YARD. Or the paren nodes which indicate that a method was defined with parentheses around the argument list.

Removing unneeded nodes from the AST (actually, not generating them in the first place) will make your code more robust. Why? Because although you don't really care about those nodes, when Ripper changes and generates different numbers of them, or puts them in different places, that can break your code. Case in point: the YARD specs currently fail on MRI 2.3.0. Why? Because of minor tweaks in the Ruby parser, which cause Ripper to generate an extra void_stmt node. YARD doesn't need that node, it isn't doing anything for you, but it breaks your tests.

This leaves the question of how to make RubyParser robustly match source code text to AST nodes. Although the issue here is just an extra ), my experimentation with Ripper has convinced me that there must be other corner cases which the current RubyParser will fail on (we just don't know what they are).

I'm trying to think of a solution which is robust and clean, and won't require a ton of new code. The best thing I can think of right now is to see if the MRI guys will accept a patch which makes Ripper pass starting/ending character indices into the parser event handlers. Then YARD's RubyParser could be really and truly fixed on new VMs, but not old ones. Any other ideas?

alexdowad avatar Oct 25 '15 19:10 alexdowad

What I am referring to as "extraneous nodes" are things like void_stmt nodes, which you can completely strip out of the AST without breaking anything in YARD. Or the paren nodes which indicate that a method was defined with parentheses around the argument list.

Actually, this is my point. Either one of these nodes may be important to certain plugins. Since YARD is a generalized library that can be used for code analysis, being able to easily identify empty method definitions (by quickly creating a handler that looks for :void_stmt) is useful. Similarly, if you're doing any form of source transform, knowing that the method is wrapped in parens can actually be important (for example, for method signature reconstruction).

I would suggest looking at this the other way around. Instead of treating these nodes as superfluous, I think it's more appropriate to treat YARD's AST implementation as incomplete. It is clear that YARD does not currently support all nodes made available via Ripper.

Typically when implementing a full AST abstraction in an OO language, you would have one class per node, each with its proper attributes and behaviors modeled directly in code. Because we're relying on an s-exp form in this case, we're taking a lot of shortcuts for the sake of maintainability but at the cost of a complete robust implementation. The real solution here is to have a complete picture of Ripper's AST-- even if we ripped out "unnecessary" nodes, YARD is still not there yet.

As for YARD's failing tests, if YARD is failing due to new nodes popping up in an otherwise backwards-compatible way, that's a deficiency in our implementation and should be fixed by being more accepting of new node configurations. If we start blacklisting nodes like void_stmt, we only end up kicking the can down the road to when new weird nodes get introduced that trip up the parser in a similar way. Whitelisting supported nodes might sound like the answer, but if you're at the point where you're writing down every valid node, you're half-way to a complete AST implementation anyway, and we might as well just do that.

there must be other corner cases which the current RubyParser will fail on (we just don't know what they are)

There are, you can look through the git blame / history on the file to see adjustments made in the past to deal with this kind of issue.

The best thing I can think of right now is to see if the MRI guys will accept a patch which makes Ripper pass starting/ending character indices into the parser event handlers.

This is a pragmatic solution, IMO, and I've thought about trying that before. If Ripper gave information about source locations for semantic AST nodes, a ton of code could be ripped out of YARD. Even if it just provided a better way to collate token events with the nodes they are associated with, that would help a great deal. It would help plenty of other Ruby devs, too.

lsegal avatar Oct 25 '15 21:10 lsegal

There are several points I would like to respond to above, but let me ask this first:

If the Ruby core devs are willing to accept a patch to Ripper, what do we want the enhanced Ripper interface to look like?

I would like it if Ripper's parser event callbacks would pass starting and ending [line_number, column_number] pairs to the handler methods. But it's hard to see how that could be done without breaking a lot of existing code.

An alternative would be to move the invocation of "scanner" events out of the lexer, and into the parser. So rather than invoking scanner events as the tokens are recognized, invoke them as the tokens are reduced into parse nodes. This would make it totally unambiguous which tokens are part of which AST nodes.

Any thoughts?

alexdowad avatar Oct 26 '15 12:10 alexdowad

Hmm, interesting discovery. The "tokens" passed to Ripper's scanner event handlers are not identical to the tokens which the lexer passes to the parser internally.

Ripper keeps 2 chars pointers, call them: "lexed up to here", and "generated scanner events up to here". At certain points in the lexing process, it calls a function ripper_dispatch_scan_event. The function checks whether the 2 pointers are identical. If not, it takes everything between the 2, converts it to a Ruby string, passes it to a scanner event handler, then bumps the "generated scanner events" pointer up to where the "lexed" pointer is.

alexdowad avatar Oct 26 '15 12:10 alexdowad

Update - working on this, looking for error patterns.

def param_test1(a = -1, b = -1) => def param_test1(a = -1,, b = -1))

Notice the repeated comma in the parsing of 1st parameter....

MSP-Greg avatar Aug 12 '16 03:08 MSP-Greg

@lsegal Think I've got patch.

Is there anywhere else in YARD where a parameter, value, whatever may be parsed?

I'll check constant & cvar values, but I've seen MethodCallNode and YARD:: Handlers:: Ruby:: Base.method_call. I haven't yet looked thru the test files to see if either of those are tested, and hence, I don't know what their functionality is.

MSP-Greg avatar Aug 15 '16 04:08 MSP-Greg

@lsegal Well, I've got the issue fixed with named & unnamed parameters.

But, there maybe issues with

YARD::Handlers::Ruby::MethodHandle#format_args

I fed it an absurd method parameter string (every kind there is), and it may need some updating.

Would you be opposed to moving that method into YARD::Parser::Ruby::ParameterNode and deprecating the current method?

Most of its functionality comes from calling methods in ParameterNode, and I think it would make testing a bit more logical, since in the handler, testing has to go to the Registry to get the MethodObject.

Obviously, it will work either way.

MSP-Greg avatar Aug 16 '16 04:08 MSP-Greg

Would you be opposed to moving that method into YARD::Parser::Ruby::ParameterNode and deprecating the current method?

Generally speaking, yes, since the AST is not responsible for (re-)formatting its own nodes back into Ruby code. In fact, YARD doesn't even really deal with reformatting Ruby code except for this one edge case, which is specific to that particular handler, not the parser. I would recommend keeping it there, which avoids deprecation and still works correctly.

lsegal avatar Aug 16 '16 04:08 lsegal

@lsegal a test file of the fix, runs benchmark and rspec against current and patched, is located at:

https://github.com/MSP-Greg/MSP-Greg.github.io/blob/master/test/unary_fix_test.rb

Along with the tests in the files, the patch passes YARD tests with 2.3.2 & 2.2.4

MSP-Greg avatar Aug 17 '16 01:08 MSP-Greg

Do you mind opening this as a pull request (with just the fix / tests)? That would be easier to review and merge.

lsegal avatar Aug 17 '16 01:08 lsegal

@lsegal Be happy to. Seeing as test code is more personal that even 'normal' code, and that I used the file for my own testing, and I wasn't sure if you wanted a benchmark test. The test code has one method removed that was my 'inspect' code.

The PR is changing code in the RipperParser, and the tests use that code, along with the MethodHandler#format_args method. As you know, the chain of objects might be listed as:

RipperParser >> ParameterNode >> Method Handler >> MethodObject

So, where would you like me to add the tests?

MSP-Greg avatar Aug 17 '16 02:08 MSP-Greg

@MSP-Greg RipperParser is an internal class so it doesn't have its own test file IIRC. Anything but the MethodObject would be appropriate, the closer to the code the better, but don't create a new test file just for that. I think RubyParser has some catch-all tests for just this kind of thing. You can functional / regression test this instead of a pure unit style thing. Unit testing a parser is kind of crazy work.

lsegal avatar Aug 17 '16 02:08 lsegal

Although I've never been involved with a 'code' parser before, I was somewhat involved with XML and WebSocket at very early stages (that might date me somewhat). XML was interesting because, at times, there were very 'serious discussions' between the document centric and the data centric groups (I was a data guy).

One could say that Ripper is designed to parse 'human readable' code into structures / sets / lists / groups / etc that are machine readable. One could reassemble the parts from those structures, but that has another set of issues.

So, YARD's trying to keep track of ranges in the source. But since the Ripper is not really concerned with the 'human readable' delimiters, there's always going to be edge cases. Hence, as they are discovered, I don't have an issue with writing tests (of any type) for them, especially with Ruby 3 on the horizon.

I think RubyParser has some catch-all tests for just this kind of thing.

Maybe add a ripper_parser_spec.rb file and put these tests and the percent array literal tests in it? Use it as a parser catch-all for 'edge cases'?

BTW, I'll post the PR tomorrow.

MSP-Greg avatar Aug 17 '16 03:08 MSP-Greg

We just hit this with Sord (issue here), almost finished an issue before finding this one :) I'll leave it below since I already wrote it all out.

My issue description

I discovered this while working on Sord. (issue here)

Steps to reproduce

Create this file:

module A
  def x(a = -1)
    # code
  end
end
  1. Run the following command: yard doc
  2. Visit the generated YARD docs
  3. Note that the generated method is invalid.

Actual Output

image

The method parameter's default value is -1).

Expected Output

I'd expect the method parameter's default value to be -1, not -1).

Environment details:

  • OS: macOS Mojave (10.14.5)
  • Ruby version (ruby -v): 2.6.2
  • YARD version (yard -v): 0.9.20

I have read the Contributing Guide.

connorshea avatar Jul 08 '19 01:07 connorshea