pegdown icon indicating copy to clipboard operation
pegdown copied to clipboard

White spaces in links reference are not supported

Open tmortagne opened this issue 9 years ago • 23 comments

We have links reference that can contain white spaces but the following

[labal](Space.Space - NiceName)

is not even interpreted as a link.

tmortagne avatar Jan 08 '16 14:01 tmortagne

@tmortagne, it isn't a link. A link would be [labal](Space.Space - NiceName)

The parentheses and brackets should be reversed.

Reference links are only interpreted as such if the reference link resolves to a real reference. Otherwise, it is treated as a "fake" reference link and is output as is during HTML serialization.

Can you share a sample file of your markdown that you think is not being interpreted correctly so I can take a look at it? Maybe there is a bug in the parsing.

vsch avatar Jan 08 '16 14:01 vsch

Yes I meant

[labal](Space.Space - NiceName)

it was a bad copy/paste.

tmortagne avatar Jan 08 '16 14:01 tmortagne

@vsch This was my fault. I reported the link wrongly. The syntax has been updated by @tmortagne in the meantime.

With the correct syntax, it doesn't work :)

Trefex avatar Jan 08 '16 14:01 Trefex

Reference links are only interpreted as such if the reference link resolves to a real reference. Otherwise, it is treated as a "fake" reference link and is output as is during HTML serialization.

We don't use HTML serialization, we use pegdown as parser only and have our own reference sytax which does not seems to collide with link syntax in this case so I was expecting pegdown to just extract it and give it.

tmortagne avatar Jan 08 '16 14:01 tmortagne

@tmortagne , that too is correct because the link should provide a URL and spaces are not part of acceptable URL characters. You need to URL encode your link, spaces are either + or %20 the complete link syntax for markdown is:

[label](url_address "title")

as you can see after the space pegdown expects to see a ) or a double quoted string. Anything else it interprets as not a link.

vsch avatar Jan 08 '16 14:01 vsch

BTW, you can use spaces for links but only in wiki links [[wiki link]] which will link to a file named wiki-link.html, but only if you have the WIKILINKS extension enabled.

vsch avatar Jan 08 '16 14:01 vsch

the link should provide a URL

Ok the part I was missing is the fact that the reference is URL syntax so indeed it's a bug on our side and we should pass the reference in a URL decoder before dealing with it then. Thanks for the explanations.

tmortagne avatar Jan 08 '16 14:01 tmortagne

BTW, you can use spaces for links but only in wiki links [[wiki link]] which will link to a file named wiki-link.html, but only if you have the WIKILINKS extension enabled.

hmm actually I did not know there was several kinds of links, but is there a way to define a label with wiki links ?

tmortagne avatar Jan 08 '16 14:01 tmortagne

@tmortagne, I fixed an issue with link parsing in my fork: https://github.com/vsch/pegdown in the develop branch that deals with not using the serializer but working with the AST directly.

The current implementation makes it impossible to tell the difference between [reference] and [reference][] with a dummy reference because it sets node.referenceKey to null for the RefLinkNode and RefImageNode in both cases. I modified the parser to use a non-null dummy reference node for the latter syntax but this requires that you test node.referenceKey != null && node.referenceKey.getChildren() != 0 to detect if the node.getChildren() represents a label for the ref link or text and the reference is in the node.referenceKey.

I haven't had the chance to prepare and post a PR to have these changes merged into the next pegdown release. I don't know if that will help your use case, but thought I'd mention it.

I too use the AST directly to generate a parse tree for my Intellij plugin idea-multimarkdown and pegdown has a few limitations in that department that I am fixing as I find them.

vsch avatar Jan 08 '16 14:01 vsch

@tmortagne, yes: [[Wiki Link|Wiki Label]] the order of the two is implemented in the LinkRenderer. The parser does not care and just gives you the text between [[ and ]]. I use a modified LinkRenderer to apply GitHub Wiki Link syntax which is the reverse of the standard 'Creole' syntax pegdown uses: [[Wiki Label|Wiki Link]]

If you have the flexibility wiki links are probably the option for you since it doesn't care about the contents between delimiters.

vsch avatar Jan 08 '16 15:01 vsch

@tmortagne, if you want wiki links to be recognized just remember to pass Extensions.WIKILINKS as an option to the parser, otherwise it will ignore the wiki links in the input.

vsch avatar Jan 08 '16 15:01 vsch

Actually we do support wiki links but I'm not the one who worked on that so I'm discovering it. But looks like @Trefex need for label is not covered by this syntax.

tmortagne avatar Jan 08 '16 15:01 tmortagne

The parser does not care and just gives you the text between [[ and ]].

It's not exactly what I would call label support if I do the parsing myself but Ok I see what you mean :) At least the standard syntax support it so I should create an issue about properly support wiki link label syntax on our side.

tmortagne avatar Jan 08 '16 15:01 tmortagne

@tmortagne, I agree but pegdown is geared more towards html generation and takes liberties in the parser that are compensated in the ToHtmlSerializer.

You really need to look at the processing done for each node type in the ToHtmlSerializer() to make sure you handle all the gotchas in the parser. There are quite a few that I tripped up on, so the suggestion is a voice of experience.

vsch avatar Jan 08 '16 15:01 vsch

@tmortagne in my plugin that has to build a parse tree for IntelliJ I go through hoops and wind up doing secondary lexing of the node text to handle its limitations. I pain, I must admit. Still a lot less work than rewriting and testing the parser from scratch. I just fix the bugs and add a few extensions here and there to deal with the edge cases.

vsch avatar Jan 08 '16 15:01 vsch

Still a lot less work than rewriting and testing the parser from scratch

I'm not considering getting rid of pegdown, it's still great so far :)

tmortagne avatar Jan 08 '16 15:01 tmortagne

I guess this issue is more or less a duplicate of #92.

tmortagne avatar Jan 08 '16 15:01 tmortagne

@tmortagne, when I added wiki link label support I did not modify the parser to avoid backward compatibility but it would be easy to add to the parser so that Wiki link nodes will have a label and address fields. This would also allow adding an extension flag to select between Creole and GitHub syntax differences that currently is not possible because LinkRenderer does get the parser options.

It is an interesting option.

vsch avatar Jan 08 '16 15:01 vsch

@tmortagne, I think that that issue was addressed with the latest release since labels are now handled and properly generated in HTML if you want Creole syntax. However, if you want it in the AST that part needs pegdown parser modification or a parser plugin.

vsch avatar Jan 08 '16 15:01 vsch

Ok #92 might be seen from HTML rendering point of view. But that would certainly be nice in the AST :)

tmortagne avatar Jan 08 '16 15:01 tmortagne

@tmortagne, I just assume that most people must be using it to generate HTML. Otherwise they would have encountered and reported the bugs in the AST.

vsch avatar Jan 08 '16 15:01 vsch

@tmortagne, for my project the effort of maintaining and supporting pegdown related issues became unbearable.

I wound up rewriting commonmark-java to replace pegdown in my Markdown Navigator plugin for IntelliJ IDEs: https://github.com/vsch/idea-multimarkdown.

The parser project is https://github.com/vsch/flexmark-java has very detailed source based AST with source offset for every part of the element. I need that for syntax highlighting and other plugin source reliant features.

The AST is very regular, part of all the tests not just ones geared for AST testing and source offsets are free of idiosyncrasies and easily adjusted. It is fully modifiable unlike pegdown's with next, prev and parent links that you can walk, unlink and move around if needed. There are many ways to extend the parser including turning off core parsing features. I geared it for extensibility and made sure handling of AST nodes is uniform whether it is part of the core or part of an extension. So extensions can and do add their own nodes to the AST.

It is CommonMark 0.27 (GitHub Comments) compliant but has parser configuration options to emulate list indentation rules used by: markdown.pl, MultiMarkdown (like pegdown 4 space indents) and kramdown (GitHub Docs). The only extensions that pegdown has that I did not yet implement are: typographic quotes, smarts and definition lists. The rest of the extensions are available, with some extra ones that pegdown does not have.

As an added bonus and what motivated me to write my own parser: the parsing is 30-50x faster than pegdown on average documents and several thousand times faster on pegdown's pathological input like [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[.

Maintaining it is a walk in the park compared to what I had to do with pegdown. The tests are in a modified markdown format, actually CommonMark test spec format extended to include the AST. The AST generation is tested on every test not just some chosen few like it was on pegdown. I literally lost weeks debugging AST source offset errors in pegdown for cases that were not part of its AST tests.

I also have code to convert from pegdown Extensions flags to my uniform configuration API that I used in the initial migration from pegdown to flexmark-java. So if you are ever on the lookout for a pegdown replacement let me know and I will give you a hand in your migration trial.

vsch avatar Dec 13 '16 00:12 vsch

@vsch this does look very interesting indeed. Markdown is not in my TODO list these days but I created http://jira.xwiki.org/browse/MARKDOWN-16 to remember about it if someone want to looks at it in more details

tmortagne avatar Dec 13 '16 08:12 tmortagne