cmark icon indicating copy to clipboard operation
cmark copied to clipboard

Implement parsing extensions

Open MathieuDuponchelle opened this issue 8 years ago • 41 comments

Reference discussion: https://github.com/jgm/cmark/issues/100

I just rebased the branch on master and tested it, make passes with no new warnings all along the branch, valgrinding a table test shows no leaks.

I applied the initial review comments from @jgm and @nwellnhof , which were to not expose cmark_strbuf, which is thus included in the sources of the reference "core extension", and to not expose API using cmark's bool type, which is replaced by integers, rendering installation of config.h unnecessary.

The design implemented is the second one I proposed in the above discussion, here's a brief summary:

  • A new structure is exposed, cmark_syntax_extension, which exposes hooks that match the block and inline parsing strategies outlined in http://spec.commonmark.org/0.25/#appendix-a-parsing-strategy .

  • This proposal does not cover the implementation of new block types, instead syntax extensions need to create blocks with types defined by cmark itself, including CMARK_NODE_CUSTOM_BLOCK and CMARK_NODE_CUSTOM_INLINE, this already covers a wide range of use cases.

    That is because in my opinion, we want cmark to guarantee that the output AST will always be correctly structured (with the exception of custom inlines and blocks), with no "impossible" or "maningless" block nesting situations.

    An interesting enhancement would be to refactor the custom inline and block types to let extensions provide virtual methods such as "can_contain" for them, but this will be somehow fragile as "inter-extension negotiation" will not be possible.

  • An important part of this proposal is that cmark implement and expose block types for which it does not yet have a formal syntax specification, such as tables or strikethroughs. I believe this is desirable because this will allow multiple syntax extensions to compete for a given desired output, without tying cmark's specification to a specific syntax, while still making sure the produced output is correct. It makes sense when looking at it as a testing ground if you will.

It is important to state that this design doesn't prevent us from implementing negotiation in the future if we so desire.

I think this branch is now approaching what I'd consider mergeable quality, with the exception of the plugin code, which I think is mostly correct on Linux, but will not work on Windows. If @nwellnhof is interested in tackling the abstraction for Windows, then all the best, otherwise the code can be merged without plugin loading / discovery for now, as it is already useful for people using cmark as a library (such as I).

Last thing, this last rebase round has been pretty involved, and I'd very much like (if we all agree that extensions are desirable) that merging this branch be made a priority, I will of course address reviews in a timely manner.

MathieuDuponchelle avatar Apr 26 '16 00:04 MathieuDuponchelle

yay MSVC \o/ I'll have a look at the issues tomorrow, hoping it does not just uncover a new layer :)

MathieuDuponchelle avatar Apr 26 '16 01:04 MathieuDuponchelle

Some general notes:

This proposal does not cover the implementation of new block types, instead syntax extensions need to create blocks with types defined by cmark itself, including CMARK_NODE_CUSTOM_BLOCK and CMARK_NODE_CUSTOM_INLINE, this already covers a wide range of use cases.

But that's not what your table and strikethrough extensions do. They define their own block types in the cmark core. If your extension nodes used the custom node types, it would be more convincing that this works for other extensions, too.

Regarding dynamic loading and plugin discovery: I think we should leave this out for now, and only have a function to register an extension with a parser.

It might also be a good idea to move everything related to extensions to a separate public header file. There are some functions like cmark_set_node_type or cmark_set_string_content that are unsafe in their current form and should only be used by extension authors. If we put such functions into cmark.h, people will start using them and complain if things start to break. The extension header file should come with a warning that this is a semi-public API only for extension authors.

nwellnhof avatar Apr 26 '16 09:04 nwellnhof

Thanks for the review, answering the general comment here:

But that's not what your table and strikethrough extensions do. They define their own block types in the cmark core. If your extension nodes used the custom node types, it would be more convincing that this works for other extensions, too.

That is what my extensions do, they don't define their own block types, cmark does, it so happens that the commits for adding some of them live in the same branch. For reference here's the code for a bunch of other extensions:

https://github.com/hotdoc/hotdoc/blob/master/hotdoc/parsers/cmark_gtkdoc_extension.c https://github.com/hotdoc/hotdoc/blob/master/hotdoc/parsers/cmark_include_scanner.c

If you look at https://github.com/hotdoc/hotdoc/blob/master/hotdoc/parsers/cmark_gtkdoc_extension.c#L256 for example, you will see that this extension instantiates a regular CMARK_NODE_CODE_BLOCK, it's a totally legit use case for the syntax extensions design I propose to create standard block types according to their own syntax rules, this makes them robust (block precedence is enforced by cmark's core) and concise (rendering for all possible output formats is handled by cmark's core)

The key element in this proposed design is that extensions do not define their own block types, as I believe there is no satisfying way to ensure inter-extension block precedence / can_contain rules. That is also why my extensions do not use the "CMARK_NODE_CUSTOM_*" types, as I am not interested in a fragile solution, or in implementing format-specific rendering code. There is also no actual correlation between the "convicingness" (is that a word?) of my proposal and the fact that my extensions (do not) use these types.

Note also that the two new block types I implemented in cmark are two extensions for which there is (or at least seems to be) a clear support intention for upstream in the future. I think it can't really hurt to implement the rendering code for them ahead of definitive syntax rules, quite the contrary actually.

Regarding dynamic loading and plugin discovery: I think we should leave this out for now, and only have a function to register an extension with a parser.

ack, I'll split that out

It might also be a good idea to move everything related to extensions to a separate public header file. There are some functions like cmark_set_node_type or cmark_set_string_content that are unsafe in their current form and should only be used by extension authors. If we put such functions into cmark.h, people will start using them and complain if things start to break. The extension header file should come with a warning that this is a semi-public API only for extension authors.

That could make sense, along with an "unstable" warning, and maybe a preprocessor directive to enable it (CMARK_ENABLE_EXTENSION_API or something like that)

I would be interested in @jgm's opinion on that last point

Last thing, @nwellnhof I noticed you made a few comments on outdated diffs, and intentionally didn't address these, can you make these comments again on the current diff?

I've had a lot of work today, and what little time I had to work on cmark I spent addressing this (very appreciated) review, I will make the agreed-upon changes tomorrow (I hope :)

Thanks!

MathieuDuponchelle avatar Apr 26 '16 22:04 MathieuDuponchelle

I apologize for the noise here (though github should apologize really), I stick to a rebase workflow because I prefer having a clean history to work with, instead of piling up commits on top of my branch.

EDIT: seems it sort of fixed itself, at the time I wrote that github was displaying the original commits and their multiple new versions

MathieuDuponchelle avatar Apr 27 '16 21:04 MathieuDuponchelle

I have addressed all the review comments, either correcting the issues that were pointed out or answering the comments.

This branch now installs a new public header "cmark_extension_api.h", separate from cmark.h, to be included and used by syntax extension writers.

Additionally, I've also separated the plugin loading and discovery code to its own commit, however after thinking back on it I'd really like @nwellnhof to come true on the windows port promise: this is a pretty important part of the functionality that you're asking me to take away from the branch, and you have pretty vehemently protested my proposal of using the GLib, which would have offered us a portable and well-tested plugin interface.

It would be a bit ugly to oppose this proposal, state that you can port the code to Windows, then just ask me to drop it when comes the time to actually do what you promised.

If you do not have time to work on that, I propose we merge the linux implementation to start with.

@jgm, do you have time for reviewing this branch these days?

MathieuDuponchelle avatar Apr 27 '16 23:04 MathieuDuponchelle

+++ Mathieu Duponchelle [Apr 27 16 16:08 ]:

[2]@jgm, do you have time for reviewing this branch these days?

It's crunch time here, but things should open up in another week or two, I think.

jgm avatar Apr 27 '16 23:04 jgm

@jgm , ack, can you prioritize reviewing and merging this if you're happy with the design over merging (significant) patches in cmark? Keeping this branch cleanly rebased on master does require a bit of extra work every now and then and I have more than enough on my plate already as well :)

MathieuDuponchelle avatar Apr 27 '16 23:04 MathieuDuponchelle

I just think you have a better chance at getting anything merged if you try to keep the initial pull request as small as possible. So

  • I'd leave out any commits that aren't related to extensions. These should be discussed separately.
  • I'd start with block extensions only (or inline only).
  • I'd leave out plugin discovery for now. This can and should be discussed separately.
  • Also, I still don't think that it's a good idea to have the code for "core" extensions in the cmark repo, at least for now.

If you look at https://github.com/hotdoc/hotdoc/blob/master/hotdoc/parsers/cmark_gtkdoc_extension.c#L256 for example, you will see that this extension instantiates a regular CMARK_NODE_CODE_BLOCK, it's a totally legit use case for the syntax extensions design I propose to create standard block types according to their own syntax rules, this makes them robust (block precedence is enforced by cmark's core) and concise (rendering for all possible output formats is handled by cmark's core)

I didn't say that extensions must use custom node types. If you just need a different syntax for "standard" nodes, that's fine. But relying on hardcoded extension node types isn't very extensible. What if someone wants to write an extension for a new node type that none of the existing types map to? And your solution for tables is even worse. You actually include struct definitions for your particular idea of table nodes in node.h. This is half-baked at best. It wouldn't allow any competing table extensions to add additional fields.

I also don't like that two additional pointers are added to struct cmark_node. That's 16 bytes on 64-bit systems that everyone has to pay for, whether they use extensions or not. I haven't given it much thought, but IMO, extension nodes should be implemented as follows:

Introduce a new node type CMARK_EXT_NODE. This node type is allowed anywhere and can have any kind of children. (We could allow more control about where extension blocks can be placed later, maybe starting with separating block and inline extension nodes. I don't think this is a priority for now.) This node type has the following struct embedded in the as union:

struct cmark_ext_node {
    struct cmark_ext_node_type *ext_type;
    void *ext_data;
};

ext_data points to extension-specific node data. cmark_ext_node_type contains information to identify extension nodes and a couple of functions pointers for node-specific operations (kind of a "class" metaobject):

struct cmark_ext_node_type {
    cmark_syntax_extension *extension;
    char *name;
    void (*free)(cmark_node *node);
    // Possibly some functions for HTML rendering
    void (*start_html)(cmark_node *node, cmark_strbuf *html);
    void (*end_html)(cmark_node *node, cmark_strbuf *html);
};

This doesn't require to bake any knowledge about extensions into the cmark core, should be truly extensible, and doesn't increase memory requirements for standard nodes. It also makes it possible to have custom renderers for extension nodes.

nwellnhof avatar Apr 29 '16 12:04 nwellnhof

@nwellnhof to be honest it feels like you're mostly paying attention to the form of this proposal, and not at all to its actual design. I can break up the branch, but it makes no sense if we don't agree on the design first.

I haven't given it much thought, but IMO, extension nodes should be implemented as follows

I think this is the issue here. Your proposed design is exactly what I first implemented, see https://github.com/MathieuDuponchelle/cmark/commit/a683b597c437df30370f014d18b6a903d6a9e942

I judged this solution inferior (and still do) after giving it a bit more thought, because of this issue, which you don't address at all in your proposal:

How can extensions implemented this way negotiate what block types the block types they define can contain and cannot be contained by, including block types provided by other extensions?

Short answer is: your proposed design cannot ensure this, and saying "these new block types can contain and be contained by anything" is just not satisfying.

Regarding the "half-bakedness" of my table proposal, your criticism is that "extensions cannot add additional fields" . I'm not sure what exactly you mean by "additional fields", but the same could be said of other block types, however I have found no trouble in providing extra syntax rules for them.

Before going any further in that discussion, I'll wait for your answer to the issue I exposed above.

MathieuDuponchelle avatar Apr 29 '16 14:04 MathieuDuponchelle

By the way just clearing things out:

I also don't like that two additional pointers are added to struct cmark_node

This is not the case, only one pointer is added to support extensions, the pointer to the extension that created the node, surely that's not such a huge price to pay for extensibility.

The other pointer is the pointer to a free function for the already existing user_data field, which IMHO should have been there in the first place.

MathieuDuponchelle avatar Apr 29 '16 15:04 MathieuDuponchelle

I can break up the branch, but it makes no sense if we don't agree on the design first.

Exactly. That's why we should discuss the design first, then find a way to break up the work into separate pieces which can be reviewed and merged one by one.

How can extensions implemented this way negotiate what block types the block types they define can contain and cannot be contained by, including block types provided by other extensions?

Short answer is: your proposed design cannot ensure this, and saying "these new block types can contain and be contained by anything" is just not satisfying.

If you allow extension nodes to tell that they should act like a certain core node type in a certain context, it's very well possible to make extensions work together, even if they don't know anything about each other. For example, a table node could specify that it can appear wherever a paragraph can, and a table cell could say that it can contain whatever a document can contain. Here's a sketch of how this could be implemented: https://gist.github.com/nwellnhof/89077f57e8da0cf13631bf4a4a62ba9f

This should work for all practical use cases. It wouldn't be possible to add, say, a table row from table-extension-A to a table from table-extension-B. For that, you'd really need a centrally defined notion of what a table, table row, or table cell is. But I can't imagine where something like this would be needed. Even if it turns out that we need some well-defined extension types in the cmark core, it wouldn't rule out my proposal.

But IMO, this is a rather unimportant feature that could even be left out completely. What's the worst that could happen? If someone adds a table cell to a node that isn't a table row, they'll only get invalid HTML from the renderer. It would be nice to detect cases like this, but I don't think it's crucial.

The other pointer is the pointer to a free function for the already existing user_data field, which IMHO should have been there in the first place.

The existing user_data field is used by external code which is typically the code that calls cmark_node_free and actually knows about the lifetime of nodes. The main motivation for the user data field was to simplify memory management for the Ruby and Perl bindings, but it could be used by C client code as well. This means that with your approach, you'll need a third field if you want to allow extra data for extensions. The free_func and extension pointer could be combined in a single type object like in my proposal, but the important point is that extensions should add as little overhead as possible if they're not used at all.

nwellnhof avatar May 02 '16 08:05 nwellnhof

For example, a table node could specify that it can appear wherever a paragraph can, and a table cell could say that it can contain whatever a document can contain.

This is just terrible design.

What's the worst that could happen? If someone adds a table cell to a node that isn't a table row, they'll only get invalid HTML from the renderer.

Well apart from a crash that's indeed the worst thing I think cmark could do, I'm not sure how you can discard it as benign tbh?

The existing user_data field is used by external code which is typically the code that calls cmark_node_free and actually knows about the lifetime of nodes.

I'm not sure why you make assumptions about usage of what seems like a general purpose API to me. Furthermore, there is a pretty bad flaw with the limited use case you're presenting, which is that there either needs to be only one user for this user_data, or that each user has to keep track of the nodes which user_data it has to free.

The main motivation for the user data field was to simplify memory management for the Ruby and Perl bindings, but it could be used by C client code as well. This means that with your approach, you'll need a third field if you want to allow extra data for extensions.

So you're telling me this "user_data" is actually really a garbage collection helper, not meant for anything else?

The free_func and extension pointer could be combined in a single type object like in my proposal, but the important point is that extensions should add as little overhead as possible if they're not used at all.

Given its premises, I'm not really interested in your proposal to be honest, I think I'll wait for @jgm 's input now.

MathieuDuponchelle avatar May 02 '16 14:05 MathieuDuponchelle

By the way a better API for user data would be equivalent to https://developer.gnome.org/gobject/unstable/gobject-The-Base-Object-Type.html#g-object-set-data and https://developer.gnome.org/gobject/unstable/gobject-The-Base-Object-Type.html#g-object-set-data-full , with nodes maintaining a reference map, this would make it actually robust, that's a more intrusive change and requires a hash table implementation though.

MathieuDuponchelle avatar May 02 '16 14:05 MathieuDuponchelle

I'll be in holidays for the next two weeks, going to https://wiki.gnome.org/Hackfests/GstSpringHackfest2016 , I'll check on that issue from time to time but I can't promise immediate reactivity :)

MathieuDuponchelle avatar May 07 '16 01:05 MathieuDuponchelle

@jgm , ping?

MathieuDuponchelle avatar May 31 '16 20:05 MathieuDuponchelle

Hello @jgm, I just rebased this branch on master, and this time it was a very painful experience. I'd really appreciate it if I could get some feedback :)

MathieuDuponchelle avatar Jun 08 '16 14:06 MathieuDuponchelle

I have tried recently to absorb your changes, but got stuck on rebasing, so thank you for doing that. You've dumped a huge number of changes into this PR, and I haven't had a correspondingly huge amount of free time. I'm afraid I may be in touch only intermittently over the next two weeks. After that I should have more time.

Do you have a narrative describing at a high level how the proposed extension system works?

I may misunderstand things, but at my present incomplete level of comprehension of your approach, I have some worries. For example, it seems that extensions need to add to cmark_node_type in cmark.h (71a4105, 2e8be01). Extensions will need to be more decoupled from the core than that, somehow.

I'd also hope for a way to do things that exposes fewer of the nitty-gritty details of the parser in the public API (things like cmark_parser_has_partially_consumed_tab).

Sorry, these impressionistic comments are all I have at the moment. I understand your frustration; I'll try to take a closer look in the next week if I can.

jgm avatar Jun 08 '16 15:06 jgm

@jgm, the idea with this branch is that I would like cmark, given an AST, to always produce a correct output, except if "custom nodes" are used.

The first solution I worked on in #100 couldn't meet this requirement, as it expected extensions to define new block types, and I could not see any solution to let the "can contain / can be contained by" rules work between block types defined by multiple extensions.

I have implemented multiple extensions using my (not up to date and bundled) fork in hotdoc, for example https://github.com/hotdoc/hotdoc/blob/master/hotdoc/parsers/cmark_gtkdoc_extension.c , and none of them actually needed to define new types, for example the gtk-doc syntax handles this syntax:

I am linking to this_function()

The new node types I have defined in cmark are types which we will want in the future, for which we haven't yet specified a syntax, and I figured it couldn't hurt to implement rendering support for them "preemptively".

An advantage of that solution is that extensions that wish to propose new syntax ruies for existing node types do not need to define duplicate node types, their containing rules and rendering methods for each of the format commonmark currently supports. This could be solved at application level by post processing the AST but that just isn't very elegant.

It's also unclear to me how much of the internals of each renderer we would need to expose (see for example "plain" in the html renderer).

It should be noted that extensions can define their own custom types already, through CUSTOM_* nodes, and if their API was slightly reworked (with on_start and on_end made into function pointers), they would be pretty adequate for users who do not consider correct output an absolute requirement. I considered this was out of scope for this proposal, and left it out.

So to answer your concern, extensions are decoupled from the core :)

Regarding the exposed implementation details, if we want extensions to participate in the parsing process, they do need to be exposed. I haven't yet used "partially_consumed_tab" in any of my extensions, but considered it was worth exposing, as I imagine it can inform the decisions made by a syntax extension.

As @nwellnhof suggested, I exposed this new API in a separate header, which we could advertise as unstable at the beginning, possibly by #ifndefing its contents with CMARK_UNSTABLE_API ?

MathieuDuponchelle avatar Jun 08 '16 16:06 MathieuDuponchelle

Fixed the leaks, @jgm note that the first commit in that branch should be merged ASAP if correct, as it fixes adding user-created nodes to the AST.

MathieuDuponchelle avatar Jun 08 '16 16:06 MathieuDuponchelle

ping again @jgm?

MathieuDuponchelle avatar Aug 20 '16 14:08 MathieuDuponchelle

There is now a PHP PECL extension of cmark (https://github.com/krakjoe/cmark), but in order to utilize it within https://github.com/thephpleague/commonmark (which is highly extensible), the underlying libcmark library needs to support this.

It would be nice to get some traction on this again.

markhalliwell avatar Apr 01 '18 16:04 markhalliwell

@markcarver I completely agree that this would be lovely, but we'd need to get @jgm back in here, and validate the design decision made by this PR, which was to implement support for output node types (eg tables) without an official agreed upon input syntax. I am still pretty convinced this is a safe enough decision, but well :)

MathieuDuponchelle avatar Aug 13 '18 22:08 MathieuDuponchelle

Any updates on this?

andrewshadura avatar Nov 26 '19 19:11 andrewshadura

ping @jgm

MathieuDuponchelle avatar Dec 02 '19 12:12 MathieuDuponchelle

Sorry, I just haven't had the bandwidth to devote to this. This extension mechanism has been implemented in github's fork of cmark. Those who need extensions can get them there. Since this library purports to be the reference implementation, and it needs to be updated every time the spec is changed, it makes sense to me to keep it simple and not worry about extensions for now at least....

jgm avatar Dec 04 '19 03:12 jgm

Well let's just close this then.

Edit: it's a bit annoying that a fork was the solution to this

MathieuDuponchelle avatar Dec 04 '19 08:12 MathieuDuponchelle

Please don't close this issue. I still believe it's needed as a primary feature in cmark, regardless of how long it may take.

Offering https://github.com/github/cmark-gfm as a "solution" is really much of a solution for https://github.com/krakjoe/cmark or https://github.com/thephpleague/commonmark

Having the ability to extend cmark should be a global ability, not some GFM implementation.

markhalliwell avatar Dec 04 '19 15:12 markhalliwell

I'd also favor keeping this open. I'm not saying the fork was the solution. I'm just saying that, for now, I don't have the time to deal with it, and keeping the profile of this project slimmer makes it easier for me to keep up with it.

jgm avatar Dec 04 '19 17:12 jgm

I'm open to merging this, really. But it's a huge patch and surveying it is not simple. And I'm reluctant to ask @MathieuDuponchelle to rebase it again so it can be reviewed, given my past footdragging on this. If someone else wants to try to get it in shape, I can take a look, but no guarantees.

jgm avatar Dec 04 '19 17:12 jgm

I think it would make more sense to split this PR up, i.e. the extension API(s) as a single PR and then the actual subsequent extension implementations into their own individual PRs?

markhalliwell avatar Dec 04 '19 17:12 markhalliwell