PerlNavigator Rewrite POD parser and Markdown converter

This PR is made mainly in order to address issue #134, but goes beyond that and also addresses a bunch of other things in one go.

In essence, this PR rewrites the entire POD parsing and POD-to-Markdown conversion logic, trying to stay as compliant to the POD specification^1 as possible. Compliance is however broken in certain places in order to not over-complicate things. More on that below.

Furthermore, the conversion of POD inline formatting codes (e.g. B<foobar>) to their respective Markdown equivalents is also improved. The server-side escaping of HTML is removed, as it's generally the client's decision whether to sanitize HTML within markdown or not.

The last commit of this PR also adds a lot of unit tests that have aided me in writing all of this; they're added as a separate commit in case they're undesired. Though, I highly suggest keeping them, as they also test for certain quirks that this implementation allows.

The unit tests are not automatically executed however; they must be run manually in the server/ directory via npm test (requires npm ci to be called first).

For all of the above, it's best to read the commit messages for further details.

Highlights

Sub-sections, lists, etc. are now included when looking up the hover documentation for subroutines and methods.
- This means that for a subroutine named foo, all POD content from =head3 foo until the next =head3 (or =head2 or =head1) will be included and converted to Markdown.
- Previously, only the ordinary and verbatim paragraphs directly after =head3 foo were included.
Nested lists (=over ... =back) and data regions (=begin ... =end) are now supported and rendered according to their contents.
Headers now correspond to their markdown equivalents.
Subroutine and method lookup is now more precise.

Previews

These previews have been made in Neovim and might look different depending on your editor.

`WWW::Mechanize` - Before

Screenshot_20240902_162953

`WWW::Mechanize` - After

Screenshot_20240902_155827

`WWW::Mechanize::new` - Before

Screenshot_20240902_163219

`WWW::Mechanize::new` - After

Screenshot_20240902_163255

Regarding Spec Conformity

While the new parser is much more spec-compliant than the old one, there are still a couple of cases where conforming to the spec was deemed too complicated or having too little ROI.

The spec^1 for example does not explicitly specify whether =item commands may also exist outside of an =over ... =back region, so the new parser just supports it (as it doesn't really make much of a difference in terms of the implementation's logic).

Furthermore, =over ... =back regions have a couple of different "flavours"^2, each having different requirements. For simplicity's sake, none of these flavours are accounted for; instead, all content inside =over ... =back regions is treated in the same manner (except =head1 etc. commands, these are not allowed and will raise an error).

This means that mixed list types like the following are valid POD:

=pod

=over

=item * Foo

=item 42. Bar

=back

=cut

The above is equivalent to:

=pod

=over 4

=item *

Foo

=item 42.

Bar

=back

=cut

.. and is converted to the following Markdown:

- Foo
42. Bar

Closing Thoughts

Any feedback is highly appreciated! I've tested this implementation via WWW::Mechanize (as you can see above) and also via the provided unit tests; there could always be edge cases I missed though, however, because POD isn't a necessarily... straightforward format.

Sep 02 '24 15:09 Aequitosh

Oh, one thing I appear to have missed is including =items when looking up a symbol's name -- it seems that some docs out there in the wild group and enumerate their subroutines in =over ... =back lists.

Will see it I can support this gracefully somehow. I guess it's best to just include all content starting from the matched =item until another =item (on the same nesting level) is encountered. Shouldn't be too hard.

Apart from that, let me know what you think! I'm happy to answer any questions, of course.

Sep 03 '24 12:09 Aequitosh

Looks great, thanks! Working through it now as it's a large commit. Some comments:

I think being more error tolerant would be helpful, especially as POD doesn't always follow the POD standard. Generally, as long as it looks like pod, we should probably show it. Some examples that now break:

File::Basename and File::Compare are core modules that break entirely because they don't end with a =cut. It throws the error: message: '"=pod ... =cut" region beginning at line 400 was never closed (missing "=cut")'

File::Fetch errors with message: 'unexpected end of paragraphs while processing "=over ... =back" block'. This is because it doesn't have a =back line, only a =cut to end the list.

File::Temp throws an uncaught error on this line: let contents = matchResult.groups?.contents.trim() || ""; Although this fixes it: let contents = (matchResult.groups?.contents || "").trim();

I found the above examples by flipping through autocompletion results on use File::, which is also a nice way to quickly spot check a large number of modules.

And as you mentioned, the lookup from item needs to be added back: use File::Copy qw(move), the move doesn't work anymore.

Also, for translating =head\d into markdown, I see that it now does "#".repeat(headerPara.level). The prior version started at 3 levels down instead which I found was a better use of space. The enormous display of =head1 would waste substantial amounts of vertical space. See the comparison is below. I also previously had it remove the NAME: title as it felt unneccessary,.

Before:

After:

Overall though, this is great. Rearchitecting this is definitely helpful and I appreciate you diving in here. POD parsing is very painful.

Sep 10 '24 02:09 bscan

Thank you for the feedback!

Regarding fault-tolerance: Interesting, I didn't know POD was interpreted so "liberally" across the ecosystem (but alas, I should've figured, it's Perl after all). I'll definitely address your points and make it more forgiving; as forgiving as possible. Perhaps I'll just make it ignore errors completely, if possible.

Regarding headers: I did not know that VSCode would actually display them so prominently; I agree that it looks way too large! I'll use a level of 3 as the starting level again.

I'll also address the uncaught error, of course.

Besides that, I recently also found two small issues:

Nested lists aren't correctly indented; the last indentation level is always skipped in order to unindent the list in Markdown. Instead, the first indentation level should be skipped.
Multiple headings belonging to the same subroutine aren't rendered correctly, as the symbol lookup function will just consider everything from the first to the succeeding heading, without checking if the heading also belongs to the subroutine.
- In other words, if there's =head3 foo(bar => \%baz) followed by =head3 foo(bar => \@baz) (because the author wants to show that keys can have either array refs or hash refs as values), only the content starting from the former up until and excluding the latter header is included. This means that the actual docs are lost when rendered as Markdown.

I'll also correct those.

So in total, the following things need to be addressed:

[x] Use a level of 3 as starting level for headers (So, a =head1 becomes ### in MD)
[x] Make the parser more tolerant regarding errors, as the exact spec isn't upheld in the wild
- If possible, just ignore errors entirely (kind of done in some places already)
[x] Fix indentation of nested lists
[x] Fix consecutive headers belonging to the same subroutine being ignored, leading to the rendered docs missing content
[x] Fix =item commands not being taken into account when looking up a symbol's name
- Note that I'll probably not stay on spec here and only consider =item foo, =item * foo and =item N. foo, because being spec-compliant in this case is a huge hassle -- e.g. =item *\n\nfoo\n\n is painful to handle, because foo is now an ordinary paragraph
[x] Fix uncaught error (shows up with File::Temp)
[x] Remove =head1 NAME from top of document if it exists, as it's unnecessary

If there's anything else you find, let me know! I'll add it to the list above and address it.

Sep 10 '24 08:09 Aequitosh

Alright, I managed to churn through all this today, which was quicker than I expected.

All new commits address the points I listed above; I tried to be as granular as possible, so reviewing is a little easier.

I did run a quick format over the file and fixed up some minor code formatting things here and there; this did affect a couple existing commits, but the contents of those commits aren't changed otherwise. All new or altered behaviour is added through new commits. Sorry for the noise there! (I blame git absorb).

For good measure I also added a little formatting feature -- see the last commit.

If you notice any other issues, please let me know! I'm always happy to help.

Sep 10 '24 15:09 Aequitosh

Hmm, on second thought, I might revisit the symbol lookup behaviour regarding =item paragraphs... There do seem to be some packages that document their subroutines like this:


=over

=item *

$object->foo()

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

=item *

$object->bar()

Vestibulum ac placerat arcu. Nulla eget risus sed eros pretium tempus in sed purus.

=back

=cut

This is genuinely annoying to handle though, because you can't really parse it all at once; e.g. it might seem logical to just include the next paragraph in the item if there's a plain =item or =item * without any additional text, but that opens a can of worms of additional cases that ought to be handled.

I'll see if I can revisit the data model in that regard to just make handling =items much easier overall, or perhaps I'll cook up some sort of "peekable" or "rewindable" iterator, or something like that.

So overall, I'm not 100% satisfied with the =item handling -- so if you consider this fine to merge, please don't do so yet. I really want to make this proper and implement something that's maintainable.

Sep 13 '24 11:09 Aequitosh

Hello! Happy new year!

It's been a little while, but I've now fixed the =item handling in a separate commit. I've also updated some very tiny spelling issues and comments here and there in older commits and commit messages, but those shouldn't matter much.

This means that this PR is ready to be reviewed again. I appreciate any feedback!

FWIW I can also meld the commits logically together instead of adding fix ... commits here and there, but I hope it's a bit easier to review now as I'm not adding everything as a huge commit at once. If you've got any preferences in that regard anyhow, please let me know.

Jan 02 '25 14:01 Aequitosh

Hi @Aequitosh, Happy New Year and thanks for continuing to push this forward! There's a lot of code here, so it may take me a while to review, but I wanted to provide some initial feedback on functionality. It looks great overall and is working for me on the WWW::Mechanize example as well as a few other key POD files.

It might makes sense to remove all of the error checking logic, and attempt to smoothly handle any non-conforming POD. Currently, when an error is encountered, it fails to retrieve any documentation at all. Removing error handling would also simplify much of the code. Some examples that currently fail:

'=back' does not have matching '=over' looks occurs in File::Basename, File::Fetch, File::Find::Rule, B::Utils, Data::Float

unexpected command paragraph "over" in "=begin future ... =end future" block happens in the core module Encode::Encoding, and a similar error occurs in Reply::Plugin::TypeTiny

Other modules may fail to parse as well, especially people's local modules which are less likely to be properly formatted.

A minor suggestion is to change the 8 spaces in tabsToSpaces to either 2 or 4 spaces (I'm currently using 2 spaces since the syntax highlighting is already a visual indicator that the section is different). When displaying docs in the autocompletion callback, horizontal space is extremely limited in the default size. Here's an example with only one level of tab. Two or three levels may be unreadable.

Also, I see that you're using Neovim. Would you mind adding perl as the language identifier in the verbatim blocks in convertVerbatimPara and seeing if it works? Vscode already natively interprets all verbatim blocks as perl, and would be a nice Neovim improvement. The idea was suggested here: https://github.com/bscan/PerlNavigator/issues/152 . I don't mean to add scope-creep, but I think it would be an easy fix for you.

Otherwise, everything seems to be working and looks like a solid improvement. Thanks again!

Jan 02 '25 20:01 bscan

It might makes sense to remove all of the error checking logic, and attempt to smoothly handle any non-conforming POD. Currently, when an error is encountered, it fails to retrieve any documentation at all. Removing error handling would also simplify much of the code.

This is a solid approach. There's going to be a lot of broken Pod out there. 😄

Jan 03 '25 00:01 oalders

Hi @bscan, thanks for the detailed response! To keep things short, I agree with all your points here and see if I can address them soon, especially the error handling part. I did keep in mind that some POD out there is more lax in its adherence to the spec, but not that lax.

Thanks a bunch! I'll address these things in additional commits.

Jan 07 '25 09:01 Aequitosh

Hello! Was quicker than I thought. The commit messages should speak for themselves, but in summary, I addressed all of your points.

Regarding docstrings not popping up for e.g. File::Basename -- that was actually a small logic error on my part. Perhaps I could clean that part up more overall, but I'll refrain from doing that for the time being, I don't want to dump more commits onto you to look at.

Also, the docstrings look much better now with syntax highlighting in verbatim paragraphs, like you suggested:

(The two empty lines (instead of one) are a Neovim thing by the way, it hides the backticks for readability unless your cursor goes over the line containing them.)

Jan 07 '25 16:01 Aequitosh

PerlNavigator PerlNavigator copied to clipboard

Rewrite POD parser and Markdown converter

Highlights

Previews

WWW::Mechanize - Before

WWW::Mechanize - After

WWW::Mechanize::new - Before

WWW::Mechanize::new - After

Regarding Spec Conformity

Closing Thoughts

PerlNavigator
PerlNavigator copied to clipboard

`WWW::Mechanize` - Before

`WWW::Mechanize` - After

`WWW::Mechanize::new` - Before

`WWW::Mechanize::new` - After