wip: mdoc reader
There's a substantial amount of work left to do here, but as I am going on vacation for four weeks on Monday and not bringing a computer with me it seems reasonable to put up a draft PR. I welcome feedback on what I've done so far.
closes #9056
I have only had a very cursory look, but one question that comes to mind is why you have a new lexer with a new kind of token. Is lexRoff from T.P.Readers.Roff inadequate for mdoc? Why? Could it be improved instead of adding a new module that does the same thing?
Part of it was just that I wanted to figure out how to implement this without having to kitbash the Roff lexer beyond recognition or keep the Man reader in sync with stuff that I changed, though I did end up extracting and reusing the escape sequences. But in a few ways the needs are fairly different.
The token type used by lexRoff in T.P.R.Roff is based on roff's native syntax, where control lines start with a request or a macro and any further arguments in the control line are simply arguments to that macro. Hence the token type constructor of ControlLine Text [Arg] SourcePos where the Text is the macro or request name and each Arg is handled as either a keyword or as literal text by the macro/request.
While the mdoc format inherits the superficial elements of roff syntax and in GNU groff is still implemented as a package of roff macros, mdoc macros themselves have a more complicated syntax. See MACRO SYNTAX in mandoc's mdoc(7) manual. The upshot is that the arguments to many macros are themselves parsed for macro calls, and in turn many macros can be called in argument position. (Cf. "Callable"/"Parsed" attributes of each macro.)
So the Mdoc.Lex lexer, instead of packaging all the arguments on a roff control line together, lexes each token from the control line individually and emits a totally linear token stream, which is more amenable to recursive parsing of macro arguments/multiple macros in one line. The lexer uses the rules for callable and parsed macros to decide whether to lex a control argument as a Macro token or as a Lit (non-macro text). It's especially handy to make this determination in the lexer because it directly takes care of escaping macro names in argument position: \&No gets lexed as Lit "No", because \& isn't a legal character to start a macro name.
For example:
.Sy hello Em world
I lex this as [Macro "Sy", Lit "hello", Macro "Em", Lit "world", Eol]. So a notional parseSy and parseEm (simpler than the ones in this branch) can boil down to this:
parseSy = do
macro "Sy"
args <- manyTill lit (anyMacro <|> eol)
return $ strong $ mconcat $ intersperse space (map toString args)
parseEm = do
macro "Em"
args <- manyTill lit (anyMacro <|> eol)
return $ emph $ mconcat $ intersperse space (map toString args)
If my token stream were of the existing RoffToken type, I would need to do an intermediate step to transform a ControlLine into a flat structure where macros are distinguished from lits. That's seemingly straightforward enough: ControlLine "Sy" ["hello", "Em", "world"] could become a list of my token type via something like
roffTokenToMdocTokens (ControlLine nm args) = Macro nm : map litOrMacro args <> [Eol]
where
litOrMacro x | isParsedMacro nm && isCallableMacro x = Macro x
| otherwise = Lit x
But! The existing roff lexer, by applying escapes, will have already destroyed some information we need if I take this approach. The following two lines will get the same lex from current lexRoff:
.Sy hello Em world
.Sy hello \&Em world
All of the above leaving aside the handling of delimiters required by mdoc but irrelevant to man, which is also convenient to deal with in the lexer.
Finally, the Roff lexer implements roff's macro definition requests, so it will actually expand any custom macros that are defined in a manual page read by the Man reader. This is very neat but I think it is an antifeature for mdoc documents, where use of raw roff requests at all, let alone custom macros, is discouraged and hopefully vanishingly rare in the wild. Only a subset of raw roff requests are supported by mandoc, and only about 3 are in use in mdoc manuals in the OpenBSD base system. So my intention was to not include that feature in the mdoc reader.
The bottom line of all this is that RoffToken and MdocToken are pretty different because the associated readers need different information from each control line. But all that being said, I guess it's plausible to at least base the lexers on some shared code by expanding on my (misnamed) RoffMonad typeclass found in T.P.R.Roff.Escape with functions like lexControlLine, lexTextLine. I'm not sure how much code would actually end up being shared though. Ultimately the MdocToken type I introduced is proving pretty adaptable to the things I need it to do and if I did try to reuse the existing lexRoff I'd probably still translate RoffToken to MdocToken for use in the parsers.
But! The existing roff lexer, by applying escapes, will have already destroyed some information we need if I take this approach.
I'd like to understand this better. I would have thought that low level roff stuff like escapes was common currency for man, ms, and mdoc. Can you explain further why we can't handle the escapes in the lexer as we were doing?
We do continue to handle the escapes in the lexer, and I'm reusing all the escaping code from T.P.R.Roff, now moved to T.P.R.Roff.Escape. There's just an interaction between applying escapes and tokenizing control lines that needs to be handled differently for mdoc. I'll hopefully make my example from before clearer. Consider these two control lines:
.Sy hello Em world
.Sy hello \&Em world
The Roff lexer lexes this as (the moral equivalent of) [ControlLine "Sy" ["hello", "Em", "world"], ControlLine "Sy" ["hello", "Em", "world"]]. (There's a couple more types involved in the argument list but the contents boil down to Texts in this instance.)
Mdoc.Lex lexes this as [Macro "Sy", Lit "hello", Macro "Em", Lit "world", Eol, Macro "Sy", Lit "hello", Lit "Em", Lit "World", Eol]. The \&Em on the second line is escaped to Em, but it also tokenizes that Em as a literal rather than a macro call. (You can actually see the difference in github's syntax highlighting!)
So if we wanted to reuse the RoffToken type for mdoc we might have to stop processing escapes within lexRoff, because escape characters (by convention \& for zero-width space) are needed to protect strings that happen to be macro names from mdoc macro expansion. The concern doesn't exist for man because there are no man macros that expand further macros in the same control line.
@jgm I’ll be home in a couple days and hopefully returning to work on this very soon. My plan/goal is to complete coverage of every mdoc macro used by manual pages in the OpenBSD base system, so that pandoc -r mdoc can parse all those manuals without any parse errors or skipped content. If you have any more feedback on what I have so far let me know.
@jgm updated the description and marked ready, sorry it's +2000 lines of code 😅
@jgm I’d love to land this. Please let me know if you want me to split this PR up somehow or if I can talk you through any of it in more detail.
I'll take a look this week. It would help if you could rebase it into logical commits (maybe just one) with the sort of commit message that could help me in crafting the changelog.
All API changes should be marked with [API change].
The test failure is due to a duplicate skylighting-core in stack.yaml (my fault, now fixed in HEAD).
2000 lines of code is a lot. Here's a thought: would it make sense to create a separate mdoc parsing library that could be used by pandoc? That's what I did with typst and commonmark; they both have independent libraries with their own types, and pandoc just includes a small interface to the pandoc types.
(I don't want to imply that 2000 lines is a nonstarter. There are other writers that are that big, I think. But it's worth considering this alternative.)
EDIT: I suppose that because of the sharing with the other roff based parsers, it may make sense to keep this all in pandoc.
I wanted to at first! But I sat down with a blank page and didn't know how to start, or how to structure the AST. I only really managed to un-daunt myself by just starting it as a Pandoc reader instead. With the benefit of this experience I could probably go write a standalone mdoc library without getting instantly stuck, but I'd still feel compelled to do some new work to design an AST that retains more mdoc-specific stuff so that it has some value beyond pandoc.
The org reader is 3620 lines (by wc -l) with the benefit of being organized into multiple files. I could take a stab at sorting things out a bit if it would help with maintenance, but it might be better to do that after merge.
fwiw I am eager to maintain this reader over time especially if reports come in from the wild about reasonable markup I'm failing to parse. If there's refactoring to do here that will make the code better I think it's easier to do that after an initial merge of the feature-complete version, since it makes it easier to review what's actually improving.
OK, sounds good. You're right that the org reader is much bigger! As is the LaTeX reader.
When you've got this rebased into logical commits that don't recapitulate the development history so much, I will take a look; maybe this can go in the upcoming release.
Repushed as two commits, one that extracts/parameterizes the escaping functions for the preexisting Roff lexer and one that has everything else. I made some minor tweaks to the typeclass for the Roff escapes to remove default definitions and caught a regression I had introduced; everything else is functionally the same. Some commentary has been added as comments that had been living in my commit messages.
This looks great. One thing that is needed: mdoc needs to be added to the list of legitimate input formats (under --from) in MANUAL.txt.
pushed the manual update on top!