pandoc Adding support for "role" attributes for the DocBook reader

As the title says, this PR adds support for the role attributes to be parsed by the DocBook reader. This enhancement has been outlined in https://github.com/jgm/pandoc/issues/9089. The PR is implementing a solution that was discussed in the issue by @jgm and @lifeunleaded in particular.

I haven't written any unit test for this, as I couldn't find a unit test file for the DocBook reader, and I wasn't ready to start one... I hope it won't be blocking for this PR to be merged!

This change didn't feel like it required a modification to the MANUAL.txt file.

Don't hesitate to tell me if I should adapt something. From my understanding, it is ready to be integrated to the main branch :)

Mar 04 '25 22:03 yanntrividic

Thanks for the PR! I wonder if we couldn't simplify things considerably by using the function https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Shared.hs#L293-L302 Using this, you could just addAttributes [("role", therole)] on any pandoc Inlines or Blocks. This might allow you to just have one place where the role is checked, instead of separate places for each type of element.

Mar 05 '25 17:03 jgm

Excellent addition, thank you! https://github.com/jgm/pandoc/pull/10183/commits/61ff730aad6d4d24da2e69ba32e7eae105680315 Should point you to DocBook reader tests, but I could take a stab at adding some if you don’t feel comfortable doing it.

Mar 06 '25 05:03 lifeunleaded

Hello, thank you both for your help!

@jgm, thank you for pointing me in this direction. Indeed it seems to be exactly what we need. I updated the code to work in this direction. Is it what you expected?

@lifeunleaded, I don't know how I missed those files... I'm not sure how we should tackle this though. Should we add in each "zone" of the code a test with a new node that has a role attribute? Tbh, if you feel like you are fine taking care of it, I wouldn't mind at all.

Mar 06 '25 15:03 yanntrividic

@yanntrividic I can take a look within a few days, as soon as I know what the changes do. I'll let @jgm determine whether that should block the PR. Otherwise I can submit a separate PR for tests.

Mar 06 '25 15:03 lifeunleaded

@yanntrividic I think it would be a good idea to just add a role attribute to some intended supported elements in test/docbook-reader.docbook and then run e.g. stack test --test-arguments='-p docbook -j4 --hide-successes' to get a view of what it expects. I just added a quick test role to the sect1 and got unexpected results. Not sure if my setup has issues or if the code needs adjusting. I'll try to look deeper soon.

Mar 06 '25 17:03 lifeunleaded

Building the branch and running ~/.local/bin/pandoc -f docbook -t native ~/sandbox/test.xml with this input:

<?xml version="1.0" encoding="utf-8" ?>                                                                                                                                                 
<article xmlns="http://docbook.org/ns/docbook"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         xmlns:mml="http://www.w3.org/1998/Math/MathML" version="5.0">
    <title>Pandoc Test Suite</title>                                                                                                                                                    

    <section id="headers" role="sect1role">
      <title>title1</title>
      <para>Test</para>
      <section><title>title2</title>                                                                                                                                                    
      <para>Test2</para>                                                                                                                                                                
      </section>
    </section>
</article>

Gives me this:

[ Header
    1
    ( "headers"
    , []
    , [ ( "role" , "sect1role" ) , ( "role" , "sect1role" ) ]
    )
    [ Str "title1" ]
, Div
    ( ""
    , []
    , [ ( "wrapper" , "1" ) , ( "role" , "sect1role" ) ]
    )
    [ Para [ Str "Test" ] ]
, Header
    2
    ( "" , [] , [ ( "role" , "sect1role" ) ] )
    [ Str "title2" ]
, Div
    ( ""
    , []
    , [ ( "wrapper" , "1" ) , ( "role" , "sect1role" ) ]
    )
    [ Para [ Str "Test2" ] ]
]

It seems to put role twice in the actual element, and then on all the child elements. I assume that's not intended?

Mar 06 '25 19:03 lifeunleaded

Hey @lifeunleaded, thanks for taking the time to dig into some tests. You're right with what you mention in https://github.com/jgm/pandoc/pull/10665#issuecomment-2704776158, it is not intended... And I'm not sure how this recursion happens.

With the updated code, here is what we get:

<?xml version="1.0" encoding="utf-8" ?>
<article xmlns="http://docbook.org/ns/docbook"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         xmlns:mml="http://www.w3.org/1998/Math/MathML" version="5.0">
    <title>Pandoc Test Suite</title>
    <section id="headers" role="sect1role">
      <title>title1</title>
      <para role="para1role">Test</para>
      <section role="sect2role"><title>title2</title>
      <para role="para2role">Test2</para>
      </section>
    </section>
</article>

[ Header
    1
    ( "headers" , [] , [ ( "role" , "sect1role" ) ] )
    [ Str "title1" ]
, Div
    ( ""
    , []
    , [ ( "role" , "sect1role" )
      , ( "wrapper" , "1" )
      , ( "role" , "para1role" )
      ]
    )
    [ Para [ Str "Test" ] ]
, Header
    2
    ( ""
    , []
    , [ ( "role" , "sect1role" ) , ( "role" , "sect2role" ) ]
    )
    [ Str "title2" ]
, Div
    ( ""
    , []
    , [ ( "role" , "sect1role" )
      , ( "role" , "sect2role" )
      , ( "wrapper" , "1" )
      , ( "role" , "para2role" )
      ]
    )
    [ Para [ Str "Test2" ] ]
]

I really don't get how the attribute propagates in the child elements, it just doesn't make sense to me. Any hints on your side? Maybe this is something obvious, and if so, I am sorry. I gotta tell you that I'm quite new to this, as those are my very first lines of Haskell 😅

Mar 09 '25 15:03 yanntrividic

Take a look at jgm/commonmark-hs , in particular commonmark-pandoc, which is where the addAttributes used in addPandocAttributes is defined. You'll see

instance HasAttributes (Cm a B.Blocks) where
  addAttributes attrs b = fmap (addBlockAttrs attrs) <$> b

instance HasAttributes (Cm a B.Inlines) where
  addAttributes attrs il = fmap (addInlineAttrs attrs) <$> il

So addAttributes, applied to a Blocks (which can be a sequence of Block elements), will set the attribute in every Block in the sequence. I think that's what is going on here, does it help?

Mar 09 '25 18:03 jgm

So, regarding this propagation of role attributes to the children, you think it is fine like this? Wouldn't we want to have only the role attribute on the element that holds it?

Mar 10 '25 10:03 yanntrividic

So, regarding this propagation of role attributes to the children, you think it is fine like this? Wouldn't we want to have only the role attribute on the element that holds it?

No, it isn't okay. It's important to figure out why that is happening and avoid it.

Mar 10 '25 15:03 jgm

Given that addPandocAttributes applies to Blocks (list of block), I'm not sure it can be applied so generally here (although I don't fully understand how it gets a Blocks in parseBlock.

If it were me, I would go through the list in parsedBlock and try to be a bit more granular. For the ones without attributes, there is divWith (like informalequation already uses), and for some there are other places to apply it. E.g. for sections, one could extend sect n to be:

sect n = sectWith(attrValue "id" e) [] (getRoleAttr e) n

This seems to pass the trivial beginning of test coverage I've tried here.

A more elegant solution would perhaps be polymorphic functions for adding the role attribute, defined separately for inlines and blocks with and without attributes, but that's not something I can whip up an example of here, if indeed it is even possible. That function name could then be used throughout.

Mar 10 '25 19:03 lifeunleaded

Worth noting that it looks like phrase and indexterm in parseInline already has handling for role. I wonder if that would also see strange effects from applying addPandocAttributes to the result.

Mar 10 '25 19:03 lifeunleaded

Here's why you're getting the role applied to all the children of section: https://github.com/jgm/pandoc/blob/17d3ad28fe856ac015366b8a71e1038b49ad35ec/src/Text/Pandoc/Readers/DocBook.hs#L1107

The sect function returns a Blocks that is a sequence of Block elements starting with the Header and including all the children.

Perhaps it should put all of this in a Div, either always or at least in the case where there is a role attribute.

Then the role will just be attached to the Div.

Pandoc sometimes uses this structure for sections:

Div with identifier and classes "section" and "level1" (or "level2" etc.)
- Header
- other contents

So that would be quite reasonable in this case.

Mar 10 '25 21:03 jgm

If I may be so bold: I think the DocBook writer is based on the role being in the Header attributes, so the roundtrip becomes a bit awkward if the reader creates an enclosing Div instead.

I would suggest moving the getRoleAttr into sect and use addPandocAttributes for the cases that do not use sect.

So sect becomes sect n = sectWith(attrValue "id" e) [] (getRoleAttr e) n and e.g. para uses this pattern in parseBlock:

parseBlock (Elem e) = do
  parsedBlock <- case qName (elName e) of
        "toc"   -> skip -- skip TOC, since in pandoc it's autogenerated                                                                                                                 
        "index" -> skip -- skip index, since page numbers meaningless                                                                                                                   
        "para"  -> addPandocAttributes (getRoleAttr e) <$> parseMixed para (elContent e)

This becomes more work, since it needs to be applied for each case in parseBlock not using sect, but the upside is that the behaviour can be more based on which element it is.

Mar 11 '25 17:03 lifeunleaded

The docbook writer can handle this structure just fine:

% pandoc -f native -t docbook
[ Div
    ( "" , [ "section" ] , [] )
    [ Header 1 ( "test" , [] , [] ) [ Str "Test" ]
    , Para [ Str "paragraph" ]
    ]
]
<section xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="test">
  <title>Test</title>
  <para>
    paragraph
  </para>
</section>

Mar 11 '25 20:03 jgm

Oh! Excellent, thank you, I wasn't aware and didn't manage to test before your reply. Then I have nothing to add, enclosing sect in Div with the role sounds like the best way forward.

Mar 11 '25 20:03 lifeunleaded

Worth noting that it looks like phrase and indexterm in parseInline already has handling for role. I wonder if that would also see strange effects from applying addPandocAttributes to the result.

With those elements, the role attributes are parsed as classes. For consistency, wouldn't it make more sense just to remove this so that every role is treated as an attribute ?

Mar 12 '25 09:03 yanntrividic

With those elements, the role attributes are parsed as classes. For consistency, wouldn't it make more sense just to remove this so that every role is treated as an attribute ?

Yes, I agree that would be better. Unless we want to be very careful about not breaking existing filters out there that expect it in classes. If we do, I suppose there is no great problem with leaving them in classes and still adding them to attributes. @jgm , any opinion?

Mar 12 '25 11:03 lifeunleaded

With those elements, the role attributes are parsed as classes. For consistency, wouldn't it make more sense just to remove this so that every role is treated as an attribute ?

Yes, I agree that would be better. Unless we want to be very careful about not breaking existing filters out there that expect it in classes. If we do, I suppose there is no great problem with leaving them in classes and still adding them to attributes. @jgm , any opinion?

I have a working solution for this--I think--but I'll wait for the third opinion on the matter before committing anything!

Mar 12 '25 13:03 yanntrividic

Probably safest to keep them as classes and also add them as roles. But I don't know if anyone is actually relying on this behavior. If the duplication is bad, we could mark it as a potentially breaking change in the changelog.

Mar 12 '25 15:03 jgm

@yanntrividic pulled recent changes to start looking at tests, but it doesn't build right now:

pandoc                           > /home/erik/git/personal/pandoc/src/Text/Pandoc/Readers/DocBook.hs:1342:25: error: [GHC-83865]
pandoc                           >     • Couldn't match expected type ‘Inline’ with actual type ‘Element’
pandoc                           >     • In the expression: e :: Inline
pandoc                           >       In an equation for ‘inlineElement’: inlineElement = e :: Inline
pandoc                           >       In the expression:
pandoc                           >         do let inlineElement = ...
pandoc                           >            if not (isEmphElement inlineElement) then
pandoc                           >                addPandocAttributes (getRoleAttr e) parsedInline
pandoc                           >            else
pandoc                           >                parsedInline
pandoc                           >      |
pandoc                           > 1342 |     let inlineElement = e :: Inline
pandoc                           >      |                         ^

Mar 12 '25 18:03 lifeunleaded

@yanntrividic pulled recent changes to start looking at tests, but it doesn't build right now:

Yes sorry! That's what I tried to explain here: https://github.com/jgm/pandoc/pull/10665#discussion_r1991569081, I think I'm close to a solution (that's why I committed this and made this comment).

Mar 12 '25 20:03 yanntrividic

@yanntrividic pulled recent changes to start looking at tests, but it doesn't build right now:

Yes sorry! That's what I tried to explain here: #10665 (comment), I think I'm close to a solution (that's why I committed this and made this comment).

No worries. I see the problem and don't have an immediate solution. If you get stuck, maybe it's easier to apply addPandocAttributes (or use spanWith directly) in the case matches in parseInline and just omit it on "emphasis", if that was the idea.

Good luck and thank you for the effort.

Mar 12 '25 20:03 lifeunleaded

Also, if you don't yet, I recommend using stack test before pushing to identify issues quicker than the pipeline. E.g. stack test --test-arguments='-p docbook -j4 --hide-successes' will run fairly quickly (if you use haskell-stack)

Mar 12 '25 20:03 lifeunleaded

The problem is with

    let inlineElement = e :: Inline

e here does not have type Inline, it has type Element. Inline is a pandoc AST concept, Element is a type from the xml parser.

Mar 13 '25 02:03 jgm

No worries. I see the problem and don't have an immediate solution. If you get stuck, maybe it's easier to apply addPandocAttributes (or use spanWith directly) in the case matches in parseInline and just omit it on "emphasis", if that was the idea.

Okay! So I tried just that, and it seems to do what you proposed. I added a class emphasis on the Span so that this information is not just gone, as this is done with other DocBook elements. What do you think? Should we try to make it into an Emph element or is it fine as such?

Mar 13 '25 10:03 yanntrividic

Also, if you don't yet, I recommend using stack test before pushing to identify issues quicker than the pipeline. E.g. stack test --test-arguments='-p docbook -j4 --hide-successes' will run fairly quickly (if you use haskell-stack)

Ah yes, thank you! Up to this point, I was mainly doing a full build and trying out different files... It is a lot more efficient! (I'm using cabal, so it is cabal test --test-options='-p docbook -j4 --hide-successes').

If you two feel it is all right like this, I can continue the work further and adapt the unit tests.

Mar 13 '25 10:03 yanntrividic

No worries. I see the problem and don't have an immediate solution. If you get stuck, maybe it's easier to apply addPandocAttributes (or use spanWith directly) in the case matches in parseInline and just omit it on "emphasis", if that was the idea.

Okay! So I tried just that, and it seems to do what you proposed. I added a class emphasis on the Span so that this information is not just gone, as this is done with other DocBook elements. What do you think? Should we try to make it into an Emph element or is it fine as such?

I think it's important that the existing mappings that create Strong, Emph, and so on are left intact, yes.

Mar 13 '25 18:03 lifeunleaded

A quick test and a look at the latest change looks like it works as I would expect, so I think we could start writing tests.

@yanntrividic Would you like me to write up a few and push to your fork, or do you feel comfortable starting them yourself?

Mar 13 '25 19:03 lifeunleaded

While trying to write a few tests, I realized that what I did in a2321612dd4a7074b74fc424134bb5b58d24c5eb wasn't really working as expected, and Headers were getting the role attribute alongside the Div that wrapped them. It is now solved in fa815ce487d31de0faaa8ff978a6e6f81e8b58be.

And now that I am facing those tests for real, I am not sure were to start. Maybe @lifeunleaded you could show me the way and I will continue if needed?

Mar 14 '25 10:03 yanntrividic