pandoc icon indicating copy to clipboard operation
pandoc copied to clipboard

XML format equivalent to native, with Reader and Writer

Open massifrg opened this issue 9 months ago • 34 comments

A new xml format as described in issue #10746, and in xml.md.

massifrg avatar Apr 14 '25 19:04 massifrg

I realized that there's another, more elegant approach you could take to the round trip tests. Instead of guarding the instances (excluding those that aren't "good"), you could normalize.

my_property = forAll (normalize <$> arbitrary) (\doc -> reader (writer doc) == doc)

where normalize has type Pandoc -> Pandoc and does kind of what you were doing before: removing consecutive Space elements, collapsing consecutive Str, etc. This can be done with walk and a function of type [Inline] -> [Inline], and in fact I think you can do all the hard work just by using Text.Pandoc.Builder: (untested)

normalize :: Pandoc -> Pandoc
normalize = walk fixInlines
 where
  fixInlines :: [Inline] -> [Inline]
  fixInlines = B.toList . B.fromList

You could also try using forAllShrink instead of forAll in the above.

There are a couple advantages to normalizing rather than guarding. First, guarding will throw out a lot of cases, which can waste time and lead to worse coverage. Second, the normalizing code is actually simpler because of the walk instance for [Inline] -> [Inline] and the fact that Builder's mappend instance does the normalizing.

jgm avatar Apr 17 '25 16:04 jgm

Thanks a lot for the advice because, by excluding ill documents, I was getting this message:

Test suite test-pandoc: RUNNING...
pandoc tests
  XML
    p_xml_roundtrip: FAIL (0.06s)
      *** Gave up! Passed only 5 tests; 1000 discarded tests.

Too few tests pass the pre-selection (they don't go beyond 10).

I was trying to do the normalization myself when I saw your message.

Since the test log outputs only the auto-generated document, how can I log the normalized one?

massifrg avatar Apr 17 '25 17:04 massifrg

Since the test log outputs only the auto-generated document, how can I log the normalized one?

Hm, my understanding is that it should log the normalized one. The (normalize <$> arbitary) will generate a normalized document that will be passed to the function.

Have you confirmed that it is logging the unnormalized one? (Or could it be that the normalization function isn't working as expected?)

jgm avatar Apr 17 '25 17:04 jgm

Here's a snippet of the current test:

p_xml_roundtrip :: Pandoc -> Property
p_xml_roundtrip d = isValidPandoc d ==> d'' == d'
  where
    d' = normalize d
    xml = purely (writeXML def) d'
    d'' = purely (readXML def) xml

normalize :: Pandoc -> Pandoc
normalize = walk (fix' . fixInlines)
  where
    fixInlines :: [Inline] -> [Inline]
    fixInlines = B.toList . B.fromList
    fix' :: [Inline] -> [Inline]
    fix' (Space : Space : xs) = fix' $ Space : xs
    fix' (Str s1 : Str s2 : xs) = fix' $ Str (s1 <> s2) : xs
    fix' (LineBreak : SoftBreak : xs) = fix' $ LineBreak : xs
    fix' (SoftBreak : LineBreak : xs) = fix' $ LineBreak : xs
    fix' (x : xs) = x : fix' xs
    fix' [] = []

How do I apply that (normalize <$> arbitary) you are suggesting me?

massifrg avatar Apr 17 '25 21:04 massifrg

You'd modify your code like this (untested):

p_xml_roundtrip :: Property
p_xml_roundtrip d = forAll (normalize <$> arbitrary)
  (\doc -> purely (writeXML def doc >>= readXML def) == doc)

jgm avatar Apr 17 '25 23:04 jgm

Thank you for the tip.

p_xml_roundtrip :: Property
p_xml_roundtrip d = forAll (normalize <$> arbitrary)
  (\doc -> purely (writeXML def doc >>= readXML def) == doc)

The purely (writeXML def doc >>= readXML def) does not work, but my limited haskell knowledge can't explain why.

Anyway this works:

p_xml_roundtrip :: Property
p_xml_roundtrip = forAll (normalize <$> arbitrary) testdoc
  where
    testdoc d =
      if isValidPandoc d
        then purely (readXML def) (purely (writeXML def) d) == d
        else discard

Now I'm discarding less (only Str that contain spaces) and normalizing more.

I must normalize recursively, because a simplification makes another one possible, so the recursion ends when the normalized document is the same.

These are the simplifications I'm doing, before the xml roundtrip:

  • consecutive Str: Str "ab", Str "cd" -> Str "abcd"
  • consecutive Emph, Strong, Underline, Strikeout, SmallCaps, Superscript, Subscript are compacted
  • consecutive Space are compacted to one Space
  • consecutive breaks (SoftBreak, LineBreak)
  • Space around breaks are removed
  • Str "" are removed
  • Para []. Plain [], Header _ _ [] are removed; is this right?

Now the test fails on tables. I must see if they need normalization or if there's something to fix in the xml reader and writer.

massifrg avatar Apr 18 '25 09:04 massifrg

The current draft PR is failing also because of not conformant commit messages.

Also, I'm bringing in some of my local commit history that is not necessary.

I prepared another branch (xmlformat), where only the modified files are there in a few commits. Once the tests of the xml branch are OK, you could discard this PR and I could make another draft PR with the xmlformat branch. Just let me know.

massifrg avatar Apr 18 '25 09:04 massifrg

purely (writeXML def doc >>= readXML def)

Right. Looks like purely takes a function as argument. So, this should do it:

purely (writeXML def >=> readXML def) doc

>=> is in Control.Monad.

As for the normalization: I would say that any normalization needs a good justification. After all, the aim is a 1-1 match with the AST. If you had Str and Space elements in your XML representation, then I don't see why any normalization would be needed at all.

I assume you don't want to have these because they are too verbose?

Would it be possible to introduce them while still keeping the convention that XML text sections get carved into Str and Space elements (using Text.Pandoc.Builder.text)?

So, for example, you could have

<Para>
<Str content=""><Space>and here is some normal text
and <Str content="consecutive"><Str "strings"><Space>
</Para>

to handle the odd cases. This would give you the ability to represent the AST 1-1 without any loss, but without sacrificing readability in normal cases.

In the case of other normalizations you're doing, like collapsing consecutive Emph elements, I don't see why they are helpful for your purposes (testing the XML reader/writer). After all, you can represent consecutive Emph elements. Or is the issue that your reader is using the Text.Pandoc.Builder mappend, which fuses these?

You could try the following normalize function which just mirrors what the Builder mappend will do: (untested)

normalize :: Pandoc -> Pandoc
normalize = walk fixInlines
 where
   fixInlines = B.toList . mconcat . map B.singleton

jgm avatar Apr 18 '25 16:04 jgm

purely (writeXML def doc >>= readXML def)

Right. Looks like purely takes a function as argument. So, this should do it:

purely (writeXML def >=> readXML def) doc

>=> is in Control.Monad.

Thanks, I'll implement that.

As for the normalization: I would say that any normalization needs a good justification. After all, the aim is a 1-1 match with the AST. If you had Str and Space elements in your XML representation, then I don't see why any normalization would be needed at all.

I assume you don't want to have these because they are too verbose?

Would it be possible to introduce them while still keeping the convention that XML text sections get carved into Str and Space elements (using Text.Pandoc.Builder.text)?

That could be a writer option, that would be also used while testing, so no normalization needed.

In the case of other normalizations you're doing, like collapsing consecutive Emph elements, I don't see why they are helpful for your purposes (testing the XML reader/writer). After all, you can represent consecutive Emph elements. Or is the issue that your reader is using the Text.Pandoc.Builder mappend, which fuses these?

I suspect the native reader does some normalization, since I can't find it in the reader. For example, spaces around LineBreak and SoftBreak are kept in the xml, but then they are suppressed in the native format.

massifrg avatar Apr 18 '25 17:04 massifrg

I suspect the native reader does some normalization, since I can't find it in the reader. For example, spaces around LineBreak and SoftBreak are kept in the xml, but then they are suppressed in the native format.

As I suggested, if you are using Builder to create the AST, it does some normalization in appending.

One option would be to stop using Builder and have the parser return Inline elements directly. Then you'd avoid this normalization.

jgm avatar Apr 18 '25 17:04 jgm

That could be a writer option, that would be also used while testing, so no normalization needed.

Maybe it doesn't need to be an option. After all, the extra elements would only be present in odd corner cases, not in regular documents. Why not leave it enabled always?

jgm avatar Apr 18 '25 17:04 jgm

One option would be to stop using Builder and have the parser return Inline elements directly. Then you'd avoid this normalization.

Or: keep using the Builder functions, but instead of using mconcat to go from a list of Inlines to an Inlines, use something like

concatMany :: [Many a] -> Many a
concatMany = Many . mconcat . map unMany

(Remember that Inlines = Many Inline and Blocks = Many Block, so this will work for both.) This should avoid all the normalization built into mappend instances for Inlines and Blocks.

jgm avatar Apr 18 '25 23:04 jgm

Or: keep using the Builder functions, but instead of using mconcat to go from a list of Inlines to an Inlines, use something like

concatMany :: [Many a] -> Many a
concatMany = Many . mconcat . map unMany

This made things much more predictable, and now tests are passing. I'm still using normalization, but I could restrict it to Str and Space Inlines, and to empty Para, Plain, Header and list items blocks.

In the xml reader, I'm now using the Table constructor and not the builder, so tables with an inconsistent number of rows/cells pass through the roundtrip conversion unaltered.

I'm going to play with <Space /> and <Str content="..." /> XML elements, as you suggested, but just as a way to make normalization unnecessary. "Well-done" documents should not need them, because their AST should not contain empty or consecutive Strs, or Str with spaces; or initial, final or consecutive Spaces.

I don't like those elements, because they make the XML less readable and make life a little harder, when you think about working on XML documents with XML tools. But I find them a good way to achieve consistency with native and JSON formats, without any caveat.

massifrg avatar Apr 22 '25 21:04 massifrg

to empty Para, Plain, Header and list items blocks.

Empty list items and headers are possible in pandoc's Markdown, and empty Para and Plain can be produced by some other formats (e.g. <p></p> in HTML). Why aren't they representable in your XML?

jgm avatar Apr 22 '25 21:04 jgm

It's not a deliberate choice. I checked that the xml writer does represent <Header />, <Para /> and <item /> in lists. So it's the reader to be fixed.

I wanted to get to a point where the roundtrip test passes. Starting from that, I can now fix the remaining inconsistencies. For some corner cases, I don't know if they are legitimate in Pandoc, so it's good to know from you.

For empty list items and headers I checked it's actually as you said.

Empty Para (and Plain), instead, are discarded by html and markdown writers. But they are kept by native and json. If I want to keep native/json equivalence, I should keep them in xml too.

massifrg avatar Apr 23 '25 09:04 massifrg

Empty Header, Para, Plain and list items are now fixed in the xml reader. The normalization in the xml test is now only about Space and Str.

massifrg avatar Apr 23 '25 10:04 massifrg

I'm going to encode Space and Str xml elements this way:

  • <Space /> is a single space, multiple spaces can be specified as <Space count="..." />, so <Space /> is equivalent to <Space count="1" />
  • <Str content="blah blah" />, where <Str /> is equivalent to <Str content="" />

Str and Space usually won't be coded as elements, just as normal text with " " spaces. The xml elements will be useful when you have these situations in the AST:

  • empty strings Str ""
  • consecutive Str: [ ..., Str "foo", Str "bar", ... ]
  • Str containing spaces: Str "foo bar"
  • consecutive spaces [ ..., Space, Space, ... ]
  • (spaces at the start or at the end of a block; this may be useful when the writer will have indentation, to discriminate between indentation and deliberate space at the start or at the end of a paragraph; but let's leave it there, for now)

massifrg avatar Apr 23 '25 10:04 massifrg

I'm going to encode Space and Str xml elements this way:

* `<Space />` is a single space, multiple spaces can be specified as `<Space count="..." />`, so `<Space />` is equivalent to `<Space count="1" />`

* `<Str content="blah blah" />`, where `<Str />` is equivalent to `<Str content="" />`

This is done. As a result, the test needs no more normalization:

module Tests.XML (tests) where

import Control.Monad ((>=>))
import Test.Tasty (TestTree)
import Test.Tasty.QuickCheck
import Tests.Helpers
import Text.Pandoc
import Text.Pandoc.Arbitrary ()

p_xml_roundtrip :: Pandoc -> Bool
p_xml_roundtrip d = d == purely (writeXML def >=> readXML def) d

tests :: [TestTree]
tests = [testProperty "p_xml_roundtrip" p_xml_roundtrip]

The xml format looks now 1:1 equivalent to native and json formats.

massifrg avatar Apr 23 '25 13:04 massifrg

I learnt how to make a benchmark, with

cabal bench

or

cabal bench --benchmark-options="-p xml"

The writer benchmarks are encouraging, because xml is faster than any other writer but json:

    json: OK
      514  μs ±  11 μs
    xml: OK
      786  μs ±  21 μs
    native: OK
      17.4 ms ± 1.5 ms

The reader benchmarks are:

    json: OK
      3.46 ms ± 226 μs
    xml: OK
      4.34 ms ± 155 μs
    native: OK
      45.2 ms ± 3.7 ms

It's not the second best, but still one of the fastest.

massifrg avatar Apr 24 '25 14:04 massifrg

I would guess that the XML representations are also considerably more compact than the JSON ones (because of the implicit representation of String, Space sequences).

jgm avatar Apr 24 '25 16:04 jgm

I would guess that the XML representations are also considerably more compact than the JSON ones (because of the implicit representation of String, Space sequences).

$ pandoc -f native -t json test/testsuite.native |wc
      1     128   52574
$ pandoc -f native -t xml test/testsuite.native |wc
    721    1827   22232

massifrg avatar Apr 24 '25 16:04 massifrg

I changed some attributes of the XML format for better consistency with Pandoc types definition:

  • Image src is now url
  • Link href is now url
  • Cell rowspan and colspan are now row-span and col-span

I've added a DTD and a RelaxNG definitions to validate XML docs against (e.g. with xmllint). Unfortunately, the DTD can't specify an element having any attribute, as it's needed by pandoc types with an Attr.

I started with a DTD, which I'm more familiar with, then converted it with trang, and then refined it to:

  • let arbitrary attributes in elements representing Pandoc types with an Attr
  • specify some constraints on some attributes, like Header's level, OrderedList's start, ColSpec's col-width, Cell's col-span and row-span

I put the pandoc-xml.dtd and pandoc-xml.rng in the tools directory. Is it the right place?

What's left to include the xml format in pandoc (maybe marked as experimental or draft)? How can I help with that?

massifrg avatar Apr 27 '25 10:04 massifrg

I renamed the Target url of Image and Link back to "src" and "href" respectively. The documentation says "url", but it's not the name of a haskell type or constructor, so I feel more free not to follow it. There's still the difference of naming between Image and Link Targets, but I have no better alternative.

I'd make a real pull request, but I prefer making a PR with a more compact history (and shorter commit messages), instead of setting this one as "ready for review".

massifrg avatar May 05 '25 13:05 massifrg

You can always rebase this draft PR so that it has the structure of commits (and messages) that you like.

jgm avatar May 06 '25 19:05 jgm

You can always rebase this draft PR so that it has the structure of commits (and messages) that you like.

I made a first rebase to shorten the history of my commits, and I hopefully resolved the resulting conflicts.

Then I tried a rebase to reword long commits, but the rebase is too deep, and I'm afraid of messing things up (in case I did not do it yet with the previous rebase).

So I just marked the PR as ready for review.

massifrg avatar May 07 '25 10:05 massifrg

I realized that my rebase efforts produced some mess, because in my "xml" branch I periodically merged the main branch.

The process of re-basing forced me to solve conflicts on commits that came from those merges, on files that are not on the scope of the xml format I worked on.

I think the best way to solve that is now to discard this PR and make a new one with only the files I added/modified for xml format support.

massifrg avatar May 07 '25 14:05 massifrg

I would make a copy of this branch how it is (in case you need to go back to it), then fix the rebasing here. It shouldn't be too hard get rid of the irrelevant changes, you can use git checkout HEAD^ while editing any commit point with unrelated stuff mixed in to clear those bits."

alerque avatar May 07 '25 14:05 alerque

I would make a copy of this branch how it is (in case you need to go back to it), then fix the rebasing here. It shouldn't be too hard get rid of the irrelevant changes, you can use git checkout HEAD^ while editing any commit point with unrelated stuff mixed in to clear those bits."

Thank you for your help @alerque.

I have also a "xmlformat" branch on my pandoc fork, with only a few commits and just what is needed for the xml format. It's a bit out of sync with "main", but it's usable, in case.

massifrg avatar May 07 '25 15:05 massifrg

Where do we stand with this?

jgm avatar May 28 '25 20:05 jgm

Today I synced my pandoc repository and tried to merge (locally) the "xml" branch into "main" with success.

I'm afraid there's still the problem of commit messages' length, that I don't know how to solve; when I tried, I messed up things, so I'm not going to try again.

There are no tests except the roundtrip one, which is challenging, but not something similar to examples to learn from.

There's a documentation file in the doc directory, but no references in the main documentation files (user guide, man page).

I'm going to work on the indentation in the xml writer, but the xml format could be released as experimental anyway, to let people test it.

massifrg avatar May 29 '25 08:05 massifrg