pandoc
pandoc copied to clipboard
Don't emit unnecessary classes in HTML tables
It is currently not possible to prevent pandoc from adding attributes to the HTML output from a Markdown input (e.g. .header, .odd, .even in the ReprEx below). It is only possible to drop attributes using filters.
Since both the CommonMark and GitHub Flavored Markdown specs do not mention default attributes in HTML output, shouldn't this be opt-in by default? Or possible to opt-out at least?
ReprEx
Using e.g. this input:
| foo | bar |
| --- | --- |
| baz | bim |
| baz | bim |
And converting to HTML using pandoc --from gfm --to html5, we get:
<table>
<thead>
<tr class="header">
<th>foo</th>
<th>bar</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>baz</td>
<td>bim</td>
</tr>
<tr class="even">
<td>baz</td>
<td>bim</td>
</tr>
</tbody>
</table>
These are harmless; just don't add CSS rules that do things with them.
I don't think adding the option to avoid these is worth the increase in complexity.
Further note: the commonmark spec says
Note that not every feature of the HTML samples is mandated by the spec. For example, the spec says what counts as a link destination, but it doesn’t mandate that non-ASCII characters in the URL be percent-encoded. To use the automatic tests, implementers will need to provide a renderer that conforms to the expectations of the spec examples (percent-encoding non-ASCII characters in URLs). But a conforming implementation can use a different renderer and may choose not to percent-encode non-ASCII characters in URLs.
The spec is not about HTML output, it's about specifying how the commonmark document should be parsed into a structured document.
These are harmless; just don't add CSS rules that do things with them.
I agree this is not a big deal. However, these are common class names that are likely to be used elsewhere in a project. It would require to either drop them to reuse the class names, or use less meaninful class names.
Agreed -- we could use something like table-header, even-row, odd-row. Of course, it would be a backwards-incompatible change, so I'm not sure it's a good idea.
My concern was for the sake of dropping "unecessary" classes to prevent name clashes since we can easely select .header using thead and .odd/even using variations of tbody tr:nth-child(2n).
Anyway, if you think theses classes are useful, you can close the issue as is.
Thanks a lot for your work on Pandoc!
My concern was for the sake of dropping "unecessary" classes to prevent name clashes since we can easely select .header using thead and .odd/even using variations of tbody tr:nth-child(2n).
This is true, though it wasn't in earlier versions of pandoc (when nth-child wasn't supported in CSS and we didn't put the header in thead!).
I think it might be worth stopping using these classes, and using the alternative you suggest instead.
Could this be a "good first issue"?
Yes, it would be an easy one -- just have to change the HTML writer, the styles.html template, and some tests I think.
Could you point me to the HTML writer please? (I'll have a look but I don't know Pandoc internals nor haskell...)
I think that if you can't find the HTML writer yourself, you're unlikely to be able to fix this issue, so I'll leave that as an exercise to the reader. :)
For what it's worth, I just ran into this issue- I was using the classname header and I didn't want or expect the class to be on every table's first tr*. I would've used a lua filter to omit it, but I can't find a way to remove classes via a filter- I suspect that's related to #684?
But uh, my use case is incredibly silly.
I'm building my site with Pandoc, and I want it to be able to render readably on the Nintendo DS Browser. That browser throws out CSS rules for elements it doesn't recognize, and it doesn't recognize most semantic html elements including header- it does recognize CSS rules for classes though, so I (reasonably I thought) assigned a class header to the element header, and then moved all my header style rules to that class. And that worked great, until I spotted every first tr* with its content squashed to the left. Anyway, I'll just rename my header for now.
Funny enough, my silly usecase is exactly the one where I'd still want the odd/even/header classes to style, since I explicitly want to target older browsers that lack the CSS.
I'm only familiar with pandoc as a user, not a dev, but why not shunt these classes into an extension rather than remove them entirely?
*: corrected
Can you give an example of your (markdown?) source? If I understand you correctly you are adding a class .header to your heading elements like
## Heading {.header}
which conflicts with the header class which Pandoc adds automatically to <thead> elements in HTML?
It does indeed seem like your options are
- to use another custom class name like
.heading.[^1] - to post-process your HTML removing the
headerclass from<thead>elements.
Obviously 1 is the easier option if possible.
[^1]: A class .heading has the advantage of being terminologically correct: tables have headers but sections have headings. Pandoc calling its heading class Header is a misnomer (which it is too late to change!)
This minimal Lua filter will change the class on all "Header" elements.
-- Predicate function to filter out 'header' class
local is_not_header_class = function(x)
return 'header' ~= x
end
-- Global function to process heading elements
Header = function(head)
-- -- Restrict to heading levels 2 through 4 (<h2>, <h3>, <h4>)
-- if 2 > head.level or 4 < head.level then
-- return nil -- leave unchanged
-- end
if head.classes:includes('header') then
-- Make sure not to remove other classes!
head.classes = head.classes:filter(is_not_header_class)
head.classes:insert(1, 'heading')
return head
end
-- Else leave unchanged
return nil
end
If I understand you correctly you are adding a class .header to your heading elements
Ah, sorry, I'm adding a class to my header element, which is not one that pandoc emits. I'm substituting my generated markdown into an HTML template to make the base of every page, since there's a fair amount of web specific stuff that markdown doesn't want or need to do.
My template, trimmed some:
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<!-- metadata -->
</head>
<div class="body">
<div class="-header">
<!-- initial page content -->
</div>
<div class="article">
$for(include-before)$ $include-before$ $endfor$ $body$
$for(include-after)$ $include-after$ $endfor$
</div>
<div class="footer">
<!-- end of page content -->
</div>
</div>
</html>
Example page:
---
title: test page!
date: 2024-05-30
author: gregdan3
description: a secret test page for all my formatting
---
# Tables
| center aligned | left aligned | right aligned | default alignment |
| :--------------------: | :------------ | ------------: | ----------------- |
| Item1.1 | Item2.1 | Item3.1 | Item4.1 |
| **_bold italic item_** | Item2.2 | Item3.2 | `mono item` |
| Item1.3 | **bold item** | Item3.3 | Item4.3 |
| Item1.4 | Item2.4 | Item3.4 | Item4.4 |
Gluing these together:
cat pages/test.md | pandoc --lua-filter=pandoc/filters.lua --from=markdown+yaml_metadata_block+wikilinks_title_after_pipe-definition_lists-smart \
--template=templates/default.html \
--metadata="directory:test.md" \
-o build/test.html
And the result:
<!DOCTYPE html>
<html lang="en-US">
<head>
<!-- metadata -->
</head>
<div class="body">
<div class="-header">
<!-- initial page content -->
</div>
<div class="article">
<h1 id="tables">Tables</h1>
<table>
<thead>
<tr class="header">
<th style="text-align: center;">center aligned</th>
<th style="text-align: left;">left aligned</th>
<th style="text-align: right;">right aligned</th>
<th>default alignment</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;">Item1.1</td>
<td style="text-align: left;">Item2.1</td>
<td style="text-align: right;">Item3.1</td>
<td>Item4.1</td>
</tr>
<tr class="even">
<td style="text-align: center;"><strong><em>bold italic
item</em></strong></td>
<td style="text-align: left;">Item2.2</td>
<td style="text-align: right;">Item3.2</td>
<td><code>mono item</code></td>
</tr>
<tr class="odd">
<td style="text-align: center;">Item1.3</td>
<td style="text-align: left;"><strong>bold item</strong></td>
<td style="text-align: right;">Item3.3</td>
<td>Item4.3</td>
</tr>
<tr class="even">
<td style="text-align: center;">Item1.4</td>
<td style="text-align: left;">Item2.4</td>
<td style="text-align: right;">Item3.4</td>
<td>Item4.4</td>
</tr>
</tbody>
</table>
</div>
<div class="footer">
<!-- end of page content -->
</div>
</div>
</html>
Also, I was mistaken before; it was the first tr being given the class header, not thead.
I think you need to post-process the HTML.
I would do it with either of
- Perl and Mojo::DOM (but be warned: this depends on all of Mojolicious!)
- Python and BeautifulSoup.
These two have pretty similar interfaces:
Perl code:
use 5.016;
use utf8;
use strict;
use warnings;
use warnings FATAL => 'utf8';
use autodie;
use Path::Tiny qw[path];
use Mojo::DOM;
my $file = path 'test.html';
my $html = $file->slurp_utf8;
my $dom = Mojo::DOM->new($html);
my $fix_classes = sub {
my($elem) = @_;
if ( 'header' eq $elem->{class} ) {
delete $elem->{class};
}
else {
$elem->{class} =~ s!\bheader\b!!;
}
};
$dom->find('tr.header')->each($fix_classes);
$file->spew_utf8($dom);
Python code:
from bs4 import BeautifulSoup
with open('test.html', mode='r' encoding='UTF-8') as fh:
text = fh.read()
soup = BeautifulSoup(text, 'html.parser')
for tr in soup.select('tr.header'):
if 1 == len(tr['class']):
del tr['class']
else:
tr['class'] = [c for c in tr['class'] if 'header' != c]
open('test.html', mode='w', encoding='UTF-8').write(soup.prettify())