phpdoc-parser Expose tokens

Expose tokens

Open dantleech opened this issue 6 years ago • 14 comments

:wave: this is might be out of scope for Phpstan, but it would be really great to be able to modify docblocks whilst preserving their formatting (this would be useful, f.e., when renaming things in phpactor).

This could be facilitated by exposing the tokens used to build each Node:

$paramTag->tokens[0]->type;    // Lexer::TOKEN_FOO
$paramTag->tokens[0]->value;  // text content of token

Would be happy to work on that if you think it makes sense here, no worries if not.

May 05 '18 08:05 dantleech

Not sure how it's related, since tokens and nodes are completelly different things, but I was redirected here by @ondrejmirtes from #11

It would be great if this package allowed format preserving print.

I already asked nikic how to do it, he explained me the basics. Basically, AbstractNode class with set/get attributes is the first step towards it: https://github.com/nikic/PHP-Parser/commit/4d2a4d02b08b75d00067f8bf8b3c58a993abc0d0#diff-ee2208021c6e96ff1de44281aa029630R108

After few weeks I managed to make working prototype, but it contains lot of boiler plate code and external storage of tokens positions etc., to make it work.

Having this would also make it easy to configure FQN namespaces, the same way php-parser does, without modifing the output. And other SOLID features :)

May 05 '18 11:05 TomasVotruba

The Tolerant PHP Parser uses Tokens in it's AST. So all properties which are rendered are Tokens, and you can walk the tree and reproduce the exact source code.

That was my rationale behind exposing the tokens. If a node has a list of tokens, then you can determine it's starting offset and ending offset, and further filter by Token Type to just, f.e. isolate the FQN for a type.

But it's different here, as the nodes have scalar properties for things like $name $description etc. not Tokens. So not sure how well the apporach I suggested would work.

Another approach would be adding the start offset and end offset from the token when the nodes are instantiated (I think this is how php-parser does it), but not sure how well this would work as f.e. MethodTagValue has a few scalar values, so any whitespace inbetween those would be lost. (?)

Just an idea anyway.

May 05 '18 17:05 dantleech

@dantleech Hi, how far have you got? :)

This might be interesting for you https://github.com/rectorphp/phpdoc-parser-printer

Dec 29 '20 00:12 TomasVotruba

Pretty much nowhere :smile: (still parsing doc-blocks with Reg-ex as speed is critical). Thanks - will keep an eye on that package

Dec 29 '20 09:12 dantleech

May be of interest, I created a new docblock parser still in early stages. It exposes all the tokens and provides a traversable AST with start / end positons for nodes etc.

Feb 06 '21 18:02 dantleech

@dantleech What's different about your approach?

Feb 06 '21 18:02 ondrejmirtes

Lossless - so you can convert back to the text
Provides access to the start/end positions of the nodes (and the tokens from which they are composed)
The AST is traversable/queryable (can't remember how this one is implemented, but the Phpactor one provides an API similar to the tolerant PHP parser
It's marginally faster than the this one, perhaps due to a slightly different Lexer.
Tolerant of incomplete docblocks (again, not sure how tolerant the phpstan one is)

One goal was to have a parser that was as performant as the very basic one currently used by Phpactor (very dumb regexes, very fast, not very clever) - as performance was the main reason for not using the PHPStan parser - but as mentioned the new one is only slightly faster than the PHPStan one, but not importantly so.

Feb 06 '21 19:02 dantleech

performance was the main reason for not using the PHPStan parser

I don't get it. If performance was main reason, why did you implement the parser the same way? 😕 If your parser is now marginally faster while still missing a lot of features, it's likely going to be slower / the same speed as this one once you implement the missing features.

I understand the need to have lossless parser. I just don't get the performance argument.

Feb 07 '21 09:02 JanTvrdik

The lossless ability can definitely be achieved here as well, @TomasVotruba already did it with decorators in https://github.com/rectorphp/phpdoc-parser-printer.

Feb 07 '21 10:02 ondrejmirtes

@ondrejmirtes So far it's not working. All nodes have to be re-generated, extended and overriden with single attributes property. It would save ~80 % code if the abstract node would have attributes, as php-parser has: https://github.com/nikic/PHP-Parser/blob/8165cf69fab95ade34cb73d1dc1c23d08b57cbb2/lib/PhpParser/NodeAbstract.php#L148-L162

The other problem is that parser here does not store data about tokens, so every new *Node must be rewritten.

These 2 problems with phpdoc-parser are reason for this issue.

Feb 07 '21 10:02 TomasVotruba

I just don't get the performance argument.

Mostly I just wanted to try and write a parser :smile: Performance is a goal - I think significant improvements can still be made to the Lexer in this respect.

Feb 07 '21 10:02 dantleech

@TomasVotruba Feel free to contribute the attributes here, I suspected for a long time it'd make your job easier.

Feb 07 '21 10:02 ondrejmirtes

@ondrejmirtes That would be awesome :) I suspected there is no interest from instant close of my original issue.

I'm on it :+1:

Feb 07 '21 17:02 TomasVotruba

Related: attributes were added in 0.5 via https://github.com/phpstan/phpdoc-parser/pull/65 :tada:

Apr 06 '21 12:04 TomasVotruba

phpdoc-parser phpdoc-parser copied to clipboard

Expose tokens

phpdoc-parser
phpdoc-parser copied to clipboard