Support for syntax extensions
Transclusion Syntax
There's often lively discussion on how to do transclusion syntax in markdown. This old discussion on the commonmark forum gives a decent overview.
A few popular ideas coming are:
path/to/file.mdJust pasting the URL (from iA Writer)Repurposing image syntax (from Obsidian){{path/to/file.md}}Using double braces (from MultiMarkdown){{#include path/to/file.md}}Double braces plus keyword (from mdBook, also see #34)
Supporting a plethora of different syntaxes in this repo is probably not maintainable, and would complicate development. An alternative would be to specify an API for defining include syntaxes and then to allow users to extend syntax support via that API.
Status Quo
Right now pandoc-include only supports the !include`<args>` path/to/file syntax, which has the advantage, that it allows for passing arguments to the directive, like you would in reST and that it's unambiguous. But it is a bit verbose and doesn't offer graceful degradation in other contexts.
The maintainer is open to a PR on this (see discussion in #54)
Block or Inline Includes?
Should this inclusion syntax only work for including "blocks" of content? "Block" meaning a paragraph containing only a single element that conforms to some criterion as specified by the syntax implementation. Or should it also work for inline elements, like the image syntax?
I feel like only supporting blocks is reasonable, because the alternative would necessitate inlining the file contents, which seems kinda complicated for little to no benefit.
Background
It seems like one could implement support for multiple syntaxes by adding an is_include_line function^[1] for each new syntax to support.
Right now function effectively has a signature of is_include_line(elem: Para) -> tuple[int, str, dict], where the int is the include type (may as well be an Enum), the str is the file name, and the dict contains config values.^[2]
[1]: Maybe it also makes sense to allow extending is_code_include, but that calls is_include_line internally, so maybe it's unnecessary.
[2]: is_code_include has same signature and semantics as is_include_line
Extension mechanism
A few popular syntaxes could be supported by default and the rest specified by users, but how would pandoc-inline know about a user defined syntax?
External Packages
Users could implement new syntaxes in external packages.
We could then use dynamic imports to load third party is_include_line functions.
Extensions could be namespace packages, that reside under a common prefix like pandoc-include-ext.pkgname.
This is quite common in Sphinx extensions, c.f. the sphinxcontrib namespace.
A meta-data option could then be used to activate an installed syntax.
Pass Syntax Rules as Options
Another approach would be to just pass all the necessary information via the metadata options.
It seems that the heart of the include-recognition logic is in is_include_line and extract_info, which heavily rely on the regexes RE_IS_INCLUDE_LINE, RE_IS_INCLUDE_HEADER, and RE_INCLUDE_PATTERN. Maybe it's possible to define a new syntax by just passing these three regexes.
Both approaches have advantages and disadvantages.
External packages give much more control over the logic, but also require much more effort to implement. Passing regexes via metadata seems much simpler both for developing and using it, but has limited extensibility, and seems like kind of an ugly solution (hard to debug and full of confusing regex black magic incantations)
@DCsunset Are you okay with using the external package mechanism for extensions and limiting usage to block includes?
If so, I'd start looking into this
Yes I'm okay with it. Thanks for your interest in looking into it!
So, it's been a while, but I finally came around to taking a look at this. I've pushed my results to the feat/transclusion_syntax branch on my fork.
I'd like for you to take a look when you have the time, and check if you're okay with my changes so far.
Overview
Changes so far:
- created namespace package
- refactored, type hinted, renamed and documented some stuff
- introduced an enum for inclusion types
- changed the API for the inclusion functions
- reworked the
Envclass inconfigmodule - implement Obsidian-style transclusion syntax
- add
include-syntaxconfig-key
Details
Namespace Package
I have created a pandoc_include.syntax sub-package, and started with moving the default transclusion logic to a module in that package.
Docs
Afaics, the return type of the is_include_line and is_code_include functions is tuple[int, Optional[str], Optional[dict]], where the int encodes the return type. I've added these type hints to the functions.
IncludeType Enum
I've also taken the liberty to introduce an IntEnum called IncludeType that replaces the global variables, and used that instead.
API change
It might be useful to have access to the Doc object in the inclusion functions, so I've added a doc: Optional[pf.Doc] = None argument to the functions. This should not be a problem, because we demand a default value of None (and shouldn't even execute the functions if there's no document to begin with).
Env class
Also, I fiddled around with the Env class a bit, because it's kinda weird. I turned it into a dataclass, got rid of the static nature and renamed its members.
include-syntax config-key
When someone sets include-syntax to the name of a package, that will be imported and its inclusion functions will be used.
Obsidian-style transclusion
Added logic and tests, but it won't work yet, because Pandoc doesn't return a Para(Image(...)), but a Figure(Plain(Image(...))) block instead. This is not caught by the if isinstance(elem, pf.Para) check in the action.
Any ideas how to proceed with this? I'd have modules register which Block elements they want as a result. E.g. the obsidian syntax module tells pandoc-include "Give me Figure and Para objects". But this would probably necessitate changing stuff in the action as well.
@DCsunset Last point is important and requires feedback
@FynnFreyer I think your changes look good to me generally. As for the last point, I think one solution is to pass all types of elememnts to the syntax module so the module can decide how to handle it. (it can simply return invalid if it's not an include element)
It's okay to change stuff in the action function or other places as you see fit. This will be included in a major release so it's a good opportunity to refactor some old stuff (while trying to minimize the changes perceivable by end users).
For the namespace package part, I think maybe we can allow using external packages by making the second parameters of import_module customizable as well in the future. In this way, the syntax package doesn't have to reside in this repo.