commonmark-java icon indicating copy to clipboard operation
commonmark-java copied to clipboard

Make inline parsing extensible

Open robinst opened this issue 6 years ago • 7 comments

Similar to how custom block parser can be registered today, it would be good to make inline parsing extensible as well. One use would be to improve autolinking, see #112.

The way this could work:

  • When we match a character during inline parsing, the last fallback is to treat it as text. Before we do that, if there are custom inline parsers, we could call each one (with some priority maybe?) and give them the opportunity to handle that.
  • In a custom inline parser, it could parse the input and return/add Nodes. It could also advance the current parsing position. So inline parsing needs to expose some API (or the custom inline parser would return instructions).

To support autolinking, a custom inline parser would also need to be able to get the text before the matching character, probably all the plain text characters up to that point. It would also need a way to detect if any brackets are active or not ("are we inside a link?").

robinst avatar Nov 10 '17 06:11 robinst

I'm pretty sure I'm not gonna pick this up in the near future. But if someone else wants to have a go at it, I'm willing to mentor. Help wanted :).

robinst avatar Nov 10 '17 07:11 robinst

Hi @robinst, I am willing to work on it. However as a newbie of both codebase and Commonmark spec. Just wanna clarify several points first to make sure I m on the same page as you guys.

  • By make inline parsing extendable, do you mean first refactor inlineParserImpl one central class into multiples inlineParserImpl for all inline components that listed on spec. So each of the components impl can have chance to parser the content in a whole line lv instead of how it handle at the moment where InlinePraserImpl.parseInline() ?
  • Also by this mean, interface BlockParser.parserInlines() would have a new signature which take a list of InlineParser as a input?

HiuKwok avatar Apr 20 '18 03:04 HiuKwok

Hi everyone, several commit have been made for let Both parser and doc parser class to accept multiple instances of InlineParser. 01f41492867e30ac067928989d1aaed679e9a735 9b28f867d156996ce5281301f7cd9bc18296d7ea e145fc521cd915a2a2cc10e09f95d65141636bcb

My next thought would be alter InlineParser interface to make it return a parsing progress object which is similar to BlockContinue class, so when block parser attempt to parser it content via lists of InlineParser. Every inlineParser would be able to pick where last InlineParser left up to.

Am I on the right track? Plz advise

HiuKwok avatar Apr 25 '18 02:04 HiuKwok

Every inlineParser would be able to pick where last InlineParser left up to.

Yes, that's the rough idea (similar to block parsers).

By make inline parsing extendable, do you mean first refactor inlineParserImpl one central class into multiples inlineParserImpl for all inline components that listed on spec. So each of the components impl can have chance to parser the content in a whole line lv instead of how it handle at the moment where InlinePraserImpl.parseInline() ?

Ideally yes.

Also by this mean, interface BlockParser.parserInlines() would have a new signature which take a list of InlineParser as a input?

Not sure about this bit, I think it would still be useful to have a single class for parsing all the inlines. Not sure what to call that though, if we use InlineParser for the thing that knows how to parse a single inline element syntax.. Maybe InlineContentParser?

robinst avatar Apr 26 '18 06:04 robinst

@robinst First of all, thx for the reply.

I think it would still be useful to have a single class for parsing all the inlines.

So more like a new parser InlineContentParser created which only know how to parse a specific inline object type?

Let say having a CodeSpanInlineContentParser which implement from InlineContentParser and only responsible to parse Code Span item and nothing more than that. Then return a InlineContinue Object which would let the next InlineContentParser know where to resume the parsing.

While all of above is happening under InlineParser.parse( ) so in one way InlineParser method signature stay unchanged. And the parsing process which be able to split into multiple class which implements InlineContentParser.

HiuKwok avatar Apr 26 '18 08:04 HiuKwok

Hi everyone,

I drafted a sample of how a InlineContentParser can be implements (32478ca54a3b94df070e564f853e0109cd0e6c7e).

Basically what have done are first created a new InlineContentParse which have a parse() method. Which then implemented by AbstractInlineContentParse.

Then from this point onward a specific inlineContentParser class can be created to parse specific type of inlineItem. For this example, created a class call NewLineBreakParser to handle soft & hard LineBreak parsing which originally handled by InlineParserImpl internally.

HiuKwok avatar Apr 27 '18 09:04 HiuKwok

Hello everyone who might be interested in this!

I've been able to achieve the customization of inline parsing in my library recently. I won't say it is perfect but it is fairly simple and implemented in a way so it's relatively easy to keep it upstream. The implementation keeps all current logic and the way inline parsing is done in general but introduces InlineProcessor abstraction. Each InlineProcessor handles unique char and is a direct copy of what can be found in specific methods of InlineParserImpl. You can see the code here

At this point I don't think commonmark-java should jump at having one inline parsing implementation, extensible or not. But having an alternative is always a nice thing. Unfortunately InlineParserImpl is using some classes that are in the internal package (although they are public and can be accessed) which makes it questionable for replication. They are:

import org.commonmark.internal.Bracket;
import org.commonmark.internal.Delimiter;
import org.commonmark.internal.ReferenceParser;
import org.commonmark.internal.util.Escaping;
import org.commonmark.internal.util.Html5Entities;
import org.commonmark.internal.util.Parsing;
import org.commonmark.internal.inline.AsteriskDelimiterProcessor;
import org.commonmark.internal.inline.UnderscoreDelimiterProcessor;

There is also the StaggeredDelimiterProcessor that is not available outside of its package which forces possible replicators of InlineParserImpl to copy the source code.

So, I think that it would be a good thing to move mentioned above classes to public surface. At least it would make possible custom implementations of InlineParser

Any feedback on my solution is welcome and if someone finds it helpful I can convert it to a Java module (currently it is published as an Android library)

noties avatar Dec 11 '19 20:12 noties