pandoc icon indicating copy to clipboard operation
pandoc copied to clipboard

Markdown reader - support new table features

Open mb21 opened this issue 5 years ago • 42 comments

Add support for (at least some of) the new table features introduced in pandoc-types/pull/66.

It would be good if at least one of pandoc markdown's table syntax would support that: grid tables seem like the obvious candidate. Something like:

+---------------+---------------+--------------------+
| Fruit         | Price         | Advantages         |
+===============+===============+====================+
| rowspan                       | - built-in wrapper |
|                               | - bright color     |
+---------------+---------------+--------------------+
| subheader     | Price         | Advantages         |
+===============+===============+====================+
| Oranges       | colspan       | - cures scurvy     |
|               |               | - tasty            |
+---------------+               +--------------------+
|| Row header   |               | - cures scurvy     |
||              |               | - tasty            |
+---------------+---------------+--------------------+
| Table foot    | Price         | Advantages         |
+===============+===============+====================+

This would roughly tick off the following of the new table features:

  • [x] rowspan, colspan (note that pandoc markdown's grid tables already support this now)
  • [x] table head and foot (note that pandoc markdown's grid tables already support this now)
  • [x] multiple header lines
  • [x] row headers
  • [ ] table attributes
  • [ ] captions that allow block-level content and include an optional short caption

It does have the disadvantage that if the last rows look like header rows, they are simply treated as the table foot.

mb21 avatar Apr 24 '20 06:04 mb21

For captions and table attributes, inspired by https://github.com/jgm/pandoc/issues/3177#issuecomment-421261363, we could use the syntax of a native div wrapping nothing but a table:

::: {#tableId}

+---------------+---------------+
| Fruit         | Price         |
+===============+===============+
| Bananas       | $1.34         |
|               |               |
+---------------+---------------+

: long caption is backward-compatible
:
: but now, just like with blockquotes, it can contain blocks.
and it can wrap lazily

:::

This would be mostly backwards-compatible with pandoc-crossref I think? @lierdakil ?

Placement of the short caption is trickier though...

mb21 avatar Apr 24 '20 06:04 mb21

I would have to modify pandoc-crossref to work with the new AST anyway, so might as well adapt to the new syntax, whatever it ends up being.

That said, I'm not exactly a fan of overloading the native div syntax, it can lead to some surprising behaviour, and will likely break some workflows.

Perhaps we could use something like this instead?

: {#tableId}
+---------------+---------------+
| Fruit         | Price         |
+===============+===============+
| Bananas       | $1.34         |
|               |               |
+---------------+---------------+

: long caption is backward-compatible
:
: but now, just like with blockquotes, it can contain blocks.
and it can wrap lazily

The lack of empty line between : {#tableId} and the table itself should I believe avoid ambiguity wrt table captions above tables, and the syntax is similar, but less noisy.

lierdakil avatar Apr 24 '20 12:04 lierdakil

I would have to modify pandoc-crossref to work with the new AST anyway, so might as well adapt to the new syntax, whatever it ends up being.

but users wouldn't have to change their markdown files? or am I mistaken or is a rare case anyway?

mb21 avatar Apr 24 '20 12:04 mb21

Internally, pandoc-crossref represents a table-with-attributes as a table-in-a-div, and that works on the syntax level, too. However, I believe most users use the short-cut syntax of adding {#tableId} to the end of the caption. Which isn't the most elegant thing in the world, but it worked for a while, and I'm not going to remove it, at least not until the next major release (which will take a while).

As for table-in-a-div, it's debatable whether to keep it or not, but probably I'll keep it as a variant syntax for the foreseeable future, because backward-compatibility is a thing I think about sometimes.

lierdakil avatar Apr 24 '20 12:04 lierdakil

There is also the simple table and multiline table syntax, which is independent of the syntax for the overall table attributes and caption. I posted this in my pull request before, but something like this:

        Item
--------------------------  ---------
Animal    Description           Price
--------- ----------------  ---------
Gnat      per-gram              13.65
          each                   0.01
Gnu       stuffed               92.50
Emu       stuffed               33.33
Armadillo frozen                 8.99

which should be parsed like an existing simple table, except that multiple header lines are allowed, and the alignments of columns are determined by the last header line. The parser would have to go back and fill in the cell dimensions after header parsing, but if the existing rule that cells cannot cross column boundaries were kept for the other header lines, then this would be easier. That would mean this table:

 h1     h2
----   ----
   large
-----------
1
2
3

might have a second header row with two cells larg and e, and two columns, the first right-aligned and the second left-aligned (and full of empty cells in the body). This depends on the exact rules, but it would be similar to what the existing parser does in the body.

This (and the multiline table version) would allow for multiple table head lines and row spans in the table head, in addition to whatever table caption or attribute syntax is allowed.

despresc avatar Apr 24 '20 13:04 despresc

There are some suggestions for extensions to pipe table syntax in the commonmark forum: see especially

  • https://talk.commonmark.org/t/tables-in-pure-markdown/81/134?u=jgm
  • https://talk.commonmark.org/t/tables-in-pure-markdown/81/137?u=jgm
  • https://talk.commonmark.org/t/tables-in-pure-markdown/81/139?u=jgm
  • https://talk.commonmark.org/t/tables-in-pure-markdown/81/145?u=jgm

Extending grid table syntax as suggested above makes sense. For the caption, I think we'd want a syntax that can allow arbitrary block-level content. Making it like definition list definitions might make sense (with the 4-space indent).

:   My caption is here.

    Second paragraph of caption.

        indented code inside caption.

But I am also somewhat tempted by the "overloading fenced div" approach, which gives us a uniform way to add table attributes and also degrades nicely. (Everything after the table itself could be considered the caption.)

If there's going to be a special way to add attributes to the table, why not just

{#id .class}

on a line by itself right before the table? (NB in my commonmark-hs I've implemented an extension allowing attributes to be placed on any block level element this way.)

We need a solution for short captions. A simple thing would be to take the first sentence of the caption, but that's probably not robust enough.

jgm avatar Apr 24 '20 14:04 jgm

If there's going to be a special way to add attributes to the table, why not just {#id .class}

Works for me, if it works. I was just being wary of potential ambiguities, but now that I think about it, those are probably not an issue.

But I am also somewhat tempted by the "overloading fenced div" approach

It's not a great solution, because then there's no concise way to have a table in a div. Which might be used for styling purposes or marking parts for filters. Most notably, this breaks syntactical backward compatibility -- granted, probably for a minority of edge cases, but I would argue it's a bad idea overall to tack on unintuitive contextual semantics onto an existing syntax that has (in theory) a very specific meaning, from my experience, it will just lead to surprises down the line, and not the good kind.

Everything after the table itself could be considered the caption.

This would be especially painful in some cases. FWIW, I do this for code blocks in pandoc-crossref (with some limitations), but that's because it's one of the few bad options I have, and not because it's a good idea.

lierdakil avatar Apr 24 '20 14:04 lierdakil

because then there's no concise way to have a table in a div

One way to reduce this impact would be to require the table divs to be marked up somehow, e.g. with class table.

jgm avatar Apr 24 '20 16:04 jgm

One way to reduce this impact would be to require the table divs to be marked up somehow, e.g. with class table.

Which we're generally trying to avoid due to i18n concerns IIRC. So it'd be at best a stopgap.

lierdakil avatar Apr 24 '20 19:04 lierdakil

Making [captions] like definition list definitions might make sense

yeah, or like blockquotes, but with the : instead of the >. Blockquotes is arguably a markdown feature more familiar to most users, and should be mostly the same except for indentation rules?

[attributes] on a line by itself right before the table? (NB in my commonmark-hs I've implemented an extension allowing attributes to be placed on any block level element this way.)

ah yes, if that's a general principle that works, that's great as well.

About overloading the div syntax: I guess to make a final decision, that should be done as part of the figure syntax? #3177

For me, we could also decide to go ahead implementing the grid table I posted in the original post of this issue, and worry about attributes and long captions later. Or should we do this directly in commonmark-hs? I'm not so up to date what's the state of progress is there...?

mb21 avatar Apr 25 '20 06:04 mb21

Yes, if someone wants to work on allowing col/rowspans in grid table syntax, that's fine and it can be done without deciding about captions and identifiers. The syntax you propose looks okay to me. I agree that the issues about captions and identifiers should be thought about in connection with figures.

commonmark-hs currently has pipe tables but I haven't tried to implement grid tables there. It would be good to do this, though!

jgm avatar Apr 26 '20 16:04 jgm

just keep in mind that grid tables are really bad for multi-line cells. Pipe tables (ala ASCIIDoc) is probably a better approach.

lrosenthol avatar Apr 30 '20 16:04 lrosenthol

See above for a link to some suggestions for pipe tables, which pandoc supports too. There's no reason we couldn't find a raw to do col/rowspans in both kinds of tables.

jgm avatar Apr 30 '20 18:04 jgm

Just for the record the correct word for "row header" is stub.

bpj avatar Apr 30 '20 21:04 bpj

Any plan to support markdown writer for new table feature?

rickywu avatar May 06 '20 01:05 rickywu

Yes, of course we'll need to support whatever formats we decide on in the writer too. I opened a new issue for that.

jgm avatar May 06 '20 17:05 jgm

To be honest, tables are some of the most annoying issues in Markdown, in particular if the table gets complex

  • arbitrary markdown in the cels
  • need some control over the rendition.

I think there are contradicting requirements:

  • table shall be powerful
  • table shall appear as tabular in the source text

I therefore propose to support at least one Table format which does not request that the table table shall appear as tabular in the source text and use a more appropriate table format such as:

  • html
  • CALS-Table (in xml)
  • equivalent representation of the table as Yaml

bwl21 avatar Jun 15 '20 08:06 bwl21

I think there are contradicting requirements:

* table shall be powerful

* table shall appear as tabular in the source text

I tend to agree. While the original impetus of Markdown might have been to have a format that is simple enough to publish as-is, Pandoc Markdown is also meant to capture sufficient complexity to be the authoring format for conversion into multiple formats.

That having been said, pipe_tables (unlike grid_tables and simple_tables) allows for "compressed" or "non-aligned" tables, and so is easy enough to write as it doesn't require a "tabular"-looking table. And unlike a format like CSV, which is also easy to write, pipe_tables has the potential to allow for cell-level alignment, multiple header-rows, colpsans/rowspans, captions, multi-line cells (to support unnumbered and numbered lists).

In particular, I like this proposal on a sufficiently-complex pipe_tables format, and think discussion around it would be beneficial: https://talk.commonmark.org/t/tables-in-pure-markdown/81/145

I also wouldn't be opposed to Pandoc Markdown natively supporting HTML5 tables syntax, since those too are simple to write and most end tags aren't required: https://talk.commonmark.org/t/tables-in-pure-markdown/81/124

I think it is also noteworthy that column spans and row spans are normally discouraged if your document is to be rendered accessibly by screen readers. So complex tables should generally be avoided whenever accessibility is a concern (as it usually should be).

the-solipsist avatar Sep 22 '20 12:09 the-solipsist

Here are my some other thoughts on the issue of pipe table extensions: https://talk.commonmark.org/t/tables-in-pure-markdown/81/134

jgm avatar Sep 22 '20 16:09 jgm

@jgm thanks for the pointer to some other thoughts.

I see tables being subject of a long discussion. But I also do not see any practical progress with this respect. How bad ...

So I really wish pandoc would support native html5 tables with markdown as table cell content. then we would have a solution to solution to the issue until the discussion converges.

bwl21 avatar Sep 22 '20 17:09 bwl21

Here are my some other thoughts on the issue of pipe table extensions: https://talk.commonmark.org/t/tables-in-pure-markdown/81/134

As far as I can see, the main feature-level differences (i.e., non-syntactical difference) between @jgm's proposal and aoudad's proposal are that aoudad's proposal provides for:

  • per-cell alignment
  • row headers

Features that neither proposal has:

The various features they both have in common are:

  • colspans
  • rowspans
  • table caption (different syntax: you don't explicitly define a caption format, but I presume table_captions would be your preference)
  • column header demarcation with support for multi-line headers
  • support for multi-line cells (different syntax)
  • per-column text alignment (left/centre/right), but not per-row alignment (top/middle/bottom)

It seems to me that syntactically they are mostly similar, with a couple of differences: multi-line cells (: vs. ! / +), and table captions ([Caption Text] (underneath) / |Caption Text| (in first cell) vs. : / Table: (above or underneath)).

I hope I didn't miss anything important differences.

Do folks think it is worth having per-cell alignment and row headers?

At any rate, would it make sense to have a feature-rich non-graphical table syntax (such as HTML5's, which seems to be both easy to type since it can do away with most end tags and has all the required features) be readily understood by Pandoc such that it is convertible into multiple formats without needing a separate filter to accomplish this?

the-solipsist avatar Sep 22 '20 17:09 the-solipsist

As for a more powerful grid/pipe table syntax to me it is important that there is an easy way to mark a column as a stub (often erroneously called "row header") column or more generally to mark a cell as what in HTML terms is a TH element. I'm thinking perhaps replace the pipe(s) to the left (or to the right in an RTL document) with (a) bang(s).

|        | Head 1 | Head 2 | Head 3
|--------|--------|--------|--------
! Stub 1 |        |        |
! Stub 2 |        |        |
! Stub 3 |        |        |

Ideally the broken bar character ¦ U+00A6 could be used or even the double vertical line ‖ U+2016 to the right. Personally I see no problem with using non-ASCII — at least Latin-1 — punctuation for syntax but I can understand that there might be disagreement; I have all Latin-1 punctuation characters available on my Swedish Linux keyboard but not everyone may be so lucky.

Whichever characters are used for syntax it is important that they can be backslash escaped inside cell content.

bpj avatar Sep 22 '20 18:09 bpj

As for more powerful syntaxes which clash with the "tables-should look like tables" principle the most common requirement is probably the ability to write a table as a list of lists. I have written a filter which converts lists of lists into tables. Note that it currently only works with pandoc < 2.10 (if anybody understands the pandoc 2.10 table model a pull request is most welcome! :-), but it shows that the filter approach to this works well.

bpj avatar Sep 22 '20 18:09 bpj

Any news here?

ssfdust avatar Jan 08 '21 03:01 ssfdust

I think that a more powerful grid table format would be a good first step. Something like https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#grid-tables (perhaps extended to support multiple headers etc.). We could use the same parser to support rst grid tables with row/colspans. I still like the idea of extending pipe table syntax, but the grid table syntax is less controversial.

jgm avatar Jul 09 '21 18:07 jgm

I agree use html format, then we can use tui.editor https://github.com/nhn/tui.editor to render Just an ieda

rickywu avatar Jul 12 '21 01:07 rickywu

Hi @jgm. Since the pipe_tables format extension isn't yet settled, and grid_tables format needs extending too, how about the HTML5 suggestion?

I also wouldn't be opposed to Pandoc Markdown natively supporting HTML5 tables syntax, since those too are simple to write and most end tags aren't required: https://talk.commonmark.org/t/tables-in-pure-markdown/81/124

So I really wish pandoc would support native html5 tables with markdown as table cell content. then we would have a solution to solution to the issue until the discussion converges.

At any rate, would it make sense to have a feature-rich non-graphical table syntax (such as HTML5's, which seems to be both easy to type since it can do away with most end tags and has all the required features) be readily understood by Pandoc such that it is convertible into multiple formats without needing a separate filter to accomplish this?

the-solipsist avatar Jul 12 '21 13:07 the-solipsist

HTML5 tables: it's an interesting idea, but one must think about how this would interact with the way raw HTML currently works in pandoc's markdown.

The current expectation is that raw HTML will be passed through verbatim to HTML (and other formats that accept HTML, like markdown ande pub), and that it will be ignored by other formats. Parsing HTML tables as native Table elements would violate that expectation and could lead to problems (e.g. for people who include both an HTML and a LaTeX version of a table to cover both formats).

There's also the issue of how it would interact with markdown_in_html_blocks (enabled by default), which allows text nodes in tables to be interpreted as markdown.

Just to throw out an idea that would avoid these issues, one could introduce an explicit fencing syntax that means: parse the following chunk of HTML (or whatever other format) using the appropriate pandoc reader, and include the result into the AST.

This would differ from our current "raw attribute" syntax, which always creates a RawBlock.

Example:

+++ html
 <table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th>
    <th>Age</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td>
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td>
    <td>94</td>
  </tr>
</table> 
+++

Of course, this would not degrade well in implementations that didn't support the special syntax. A sneakier approach would be to use HTML comments or processing instructions:

<?read?>
<table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th>
    <th>Age</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td>
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td>
    <td>94</td>
  </tr>
</table> 

The "read" instruction would tell pandoc to try to parse a following raw block (which could be raw latex, raw html, or raw anything using a fence and a raw attribute) and parse it from its native format. The advantage of this is that the instruction would just be ignored by implementations that don't support this feature (e.g. on GitHub), so you could at least get the HTML table out in HTML output, while with pandoc you'd have the increased power of being able to convert it to any format.

jgm avatar Jul 12 '21 16:07 jgm

Alternatively we could have a special attribute in the HTML, e.g.

<table data-parse="1">
...

jgm avatar Jul 12 '21 16:07 jgm

For me it would be important that the tables cells could be markdown (with lists and multiple paragraphs, even images)

Nested tables is IMHO less important

I like the approach with the processing instruction.

<?pandoc table="parse-markdown"?>
<table style="width:100%">
  <tr>
    <th>Firstname</th>
    <th>Lastname</th>
    <th>Age</th>
    <th>Bio</th>
  </tr>
  <tr>
    <td>Jill</td>
    <td>Smith</td>
    <td>50</td>
   <td>
    Jill was born and had a good childhood. Then she
    * went to school
    *  went to university
    * got familiar with Pandoc

 now she is a happy user of [pandoc](www.pandoc.org)
  </td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td>
    <td>94</td>
    <td>
    Eve was born and had a good childhood. Then she
    * went to school
    *  went to university
    * got familiar with Pandoc

 now she is a happy user of [pandoc](www.pandoc.org)
   </td>
  </tr>

<table data-pandoc="parse markdown"> is also fine.

bwl21 avatar Jul 12 '21 17:07 bwl21