pandoc Table row and column groups from Markdown

Describe your proposed improvement and the problem it solves.

It would be good to be able to create HTML's row groups (multiple tbodys) and column groups (colgroup) from Markdown. They are useful because such groups can be considered the semantics corresponding to a styling e.g. with (stronger) divider lines, which can be used to make orientation easier in larger tables.

The internal Pandoc representation of tables allows for the presence of multiple table bodies, but as far as I can tell there is no way to create such a structure from Markdown input.

An obvious approach would be to extend grid tables to support a third kind of divider character besides - and =, but ASCII does not contain any suitable horizontally oriented characters. One possibility is to modify the edge character, +. For example, a divider in which the first and last edge are * instead of + would separate different table bodies.

Another structural feature of HTML tables which does not seem to be represented by Pandoc internally are column groups. If it were to be implemented, column groups could be created from Markdown in the analogous way: The uppermost and lowermost + of a vertical divider would be * instead of +.

Example:

+----------*----------+----------+
| Header 1 | Header 2 | Header 3 |
+==========+==========+==========+
|   Row 1  |   Data   |   Data   |
+----------+----------+----------+
|   Row 2  |   Data   |   Data   |
*----------+----------+----------*
|   Row 3  |   Data   |   Data   |
+----------+----------+----------+
|   Row 4  |   Data   |   Data   |
+==========+==========+==========+
|  Footer  |   Data   |   Data   |
+----------*----------+----------+

In this table there are two column groups, containing 1 and 2 columns respectively, and two row groups, containing 2 rows each.

Describe alternatives you've considered.

The alternative would be to directly include HTML code.

Jun 07 '24 19:06 allefeld

(...) first and last edge are * instead of + would separate different table bodies.

(...) [for] column groups (...) uppermost and lowermost + of a vertical divider would be * instead of +

Interesting, but the problem could be visibility. Maybe 'o' instead of '*'?

Also, for more complex tables, you eventually have to consider the (sub)header in a pandoc Table Tbody, for which the '=' separator is a good candidate.

Also (bis), in principle two tbodies could have different Row Head Column numbers.

(...) a third kind of divider character besides - and =, but ASCII does not contain any suitable horizontally oriented characters

I've played with a tentative syntax in my own experimental md Table reader/writer during the pandemic, where for the subtable (tbody) division I use the '~' separator.

If you don't mind, I could give examples below (although I think this could be in Discussions section).

Jun 07 '24 22:06 kysko

o instead of * would be fine, too.

I didn't think of ~ for the horizontal, makes sense, but what about the vertical?

I wasn't aware of intermediate heads.

Please feel free to add examples, or to start a discussion referencing this.

Jun 08 '24 17:06 allefeld

Firstly, I see I have not really addressed your colgroups issue. I mentioned Row Head Column (RHC), which are kind of column grouping within a TBody, but that's not the same thing.

As you say, multiple colgroups are not represented internally, although they could be encoded somehow in the Table attributes.

There was some kind of support for grid table attribute added in pandoc 3.1.11.1, at least for ID, at the end of a table grid caption; if it could be extended to any attribute, that would be an option for expressing colgroups even if it is never explicitly implemented visually in the grid table itself.

I wasn't aware of intermediate heads

These subheads can complicate the syntax.

Let's say the usual = separator is chosen for TBody subhead. Consider then the following, using either your "bookend syntax" (if I may call it that) (I'll use o for visibility), or the ~ separator:

+-----+   or  +-----+
| A   |       | A   |
+=====+       +=====+
| B   |       | B   |
o-----o       +~~~~~+
| C   |       | C   |
+-----+       +-----+

In either syntax above, we would have two TBodies. But, do we have a first TBody with subhead A and body B, or do we have two TBodies, B and C, with A as TableHead?

Any syntax must be able to distinguish those two cases. Let's illustrate this by some possible solutions.

With the ~ separator, an ugly solution I had come up with was to introduce a double separator when there was a TableHead, which would give:

: A is a THead          or     : A is Not a THead
                               
+-----+                        +-----+   <- TBody 1              
| A   |   <- THead             | A   |      <- subhead of TBody 1
+=====+                        +=====+                           
+~~~~~+                        | B   |      <- body of TBody 1   
| B   |   <- TBody 1           +~~~~~+                           
+~~~~~+                        | C   |   <- TBody 2              
| C   |   <- TBody 2           +-----+                           
+-----+

(The distinct = separator for the THead gives a distinct line on which to place the global column alignments.)

But now I see that your idea would solve it in a cleaner way, since we can consider the THead as some kind of special TBody:

: A is a THead          or     : A is Not a THead
                               
+-----+                        +-----+   <- TBody 1
| A   |   <- THead             | A   |      <- subhead of TBody 1
o=====o                        +=====+
| B   |   <- TBody 1           | B   |      <- body of TBody 1
o-----o                        o-----o
| C   |   <- TBody 2           | C   |   <- TBody 2
+-----+                        +-----+

(However, it would "steal" the alignment locations from the cells immediately bellow (which only matters if individual cell alignment on the grid is ever officially implemented).)

I'd prefer a distinct separator for subtables, but I admit getting rid of that double separator in the case above is satisfying.

Another possible solution: maybe use your bookend indicators for the subtables, and use the ~ separator for the subheads?

: A is a THead          or     : A is Not a THead
                               
+-----+                        +-----+   <- TBody 1
| A   |   <- THead             | A   |      <- subhead of TBody 1
+=====+                        +~~~~~+
| B   |   <- TBody 1           | B   |      <- body of TBody 1
o-----o                        o-----o
| C   |   <- TBody 2           | C   |   <- TBody 2
+-----+                        +-----+

Anyways... just throwing ideas out there...

(There was a similar problem with the TableFoot, but tarleb solved it by imposing a last separator with =.)

what about the vertical?

A possible character for vertical separator is § (for RHC's), but it's admittedly a bit ugly, and not ASCII (and if we accept non-ASCII, there is a better candidate in the box-drawing group (U+2551)).

Elsewhere, I think some have suggested double pipes (like U+2551, but as two characters).

Jun 09 '24 02:06 kysko

The current table syntax already leads to subtle mistakes, such e.g. #9740. Adding more syntax to grid tables is very likely going to lead to more such problems.

The tilde ~ isn't being used yet in table syntax, but is looks similar to a dash in most fonts, which would add one more potential source of problems for authors.

I tend to think that a lightweight markup format like Markdown is just not suited to expressing this level of detail, and that the preferable solution is just to resort to raw HTML.

Jun 10 '24 09:06 tarleb

What about a +html_tables extension allowing the Markdown parser to parse HTML tables, presumably with a custom markdown=parse attribute on the <table> element to allow and parse Markdown syntax inside cells and caption?

I'm thinking of writing a new list-to-table filter with potentially three list levels where the top level is - head - body - foot, probably subject to an attribute on the enclosing div.

Jun 10 '24 09:06 bpj

The native_divs and native_spans extension control somewhat similar functionalities, so there's a case to be made for native_tables and native_figures extensions. But maybe this functionality should be left to filters, I'm not sure.

Jun 10 '24 12:06 tarleb

I see the appeal of html_tables, but anything that combines HTML parsing with markdown parsing gets to be a giant pain.

I think some version of "list tables" might be a better approach for complex layouts.

Jun 10 '24 15:06 jgm

I admit that this is stretching the abilities of readable Markdown, and using tables in HTML markup is an alternative, if they are parsed and transformed for non-HTML output. "List tables" would be fine, too.

Downgrading my request: Could we get colgroup representation in the AST, read from HTML, and with support for Word and LaTeX output?

More generally speaking, when preparing tables for a paper, I had the idea that a dedicated external file format for them could be useful. In an academic workflow, figures are created separately and then included and combined with captions etc. Why don't we do the same with tables? The table-file format could e.g. simply be HTML, stripped down to what the AST supports. They would then be included with the same syntax as figures, ![caption](table-file.html).

Jun 10 '24 16:06 allefeld

I thought about this some more, and I now agree that Markdown is not the right format for this. Even though list tables might be useful, at least easier to edit, they already scratch the limits of the "readable in source" maxim of Markdown. If they were to represent the full complexity of the HTML table model, or even just the simplified Pandoc AST table model, one would have unmanageable blobs of code inside a document which is supposed to be readable. I'll therefore pursue the idea of an external file format outlined in my last comment.

On the risk of being annoying:

Could we get support for multiple colgroups in the AST? Alternatively, support for classes on colspecs?

The reason:

Some of the tables I'd like to create have "header columns" and "footer columns", sometimes called the stub – the horizontal equivalent to TableHead and TableFoot. They could be rendered e.g. with bold font and / or separated by a line from the rest of the table. This image illustrates the concept: A simple example would be a table of correlations between variables, where variables occur as both column names and row names, or a distance table, where places occur as both column names and row names:

AST support is important to get this structure across different output formats.

Nov 03 '24 03:11 allefeld

We already have AST support for this. https://hackage.haskell.org/package/pandoc-types-1.23.1/docs/Text-Pandoc-Definition.html#t:TableBody RowHeadColumns specifies the number of "stub" columns. This may not yet be supported in all the writers.

Nov 04 '24 18:11 jgm

Thanks!

I missed that, probably because I was looking in ColSpec, or anything related to "col".

Nov 04 '24 19:11 allefeld

Possibly relevant to this discussion:

Quarto can and by default does process HTML tables within raw HTML blocks and translates them so they can be output to other formats. I have not yet checked whether that supports features beyond Markdown tables. See Tables / HTML Tables in the Quarto documentation.

Nov 08 '24 21:11 allefeld

pandoc pandoc copied to clipboard

Table row and column groups from Markdown

pandoc
pandoc copied to clipboard