zola icon indicating copy to clipboard operation
zola copied to clipboard

Warn about unprocessed Markdown in rendered HTML

Open ralfbiedert opened this issue 3 years ago • 12 comments

Bug Report

Environment

Zola version: 0.13

Background

I recently upgraded from Zola 0.9 to 0.13. Along the way Markdown processing must have changed:

0.9: image

0.13: image

Since the site is a bit longer and there were no errors it took me a few days to realize this happened.

Expected Behavior

First of all, I'm not even sure if the new Markdown processing is a bug or a feature. I'm assuming it's a feature, and I can respect that authors might have to adjust their page a bit along the way.

However, it would be nice if Zola could warn that something broke that previously worked.

A minimum feature could be to simply warn if the rendered HTML contains [xxx](yyy), _xxx_ or *xxx* sequences, maybe as part of zola check (maybe even with an extra option zola check --xxx.

Step to reproduce

Clone https://github.com/ralfbiedert/cheats.rs, render with 0.13 or 0.9.

Update

On a side note, starting a line with a shortcode {{ note() }} was what caused the markdown to render differently:

{{ note(note="1") }} Technically `async` transforms following code into anonymous, compiler-generated state machine type; `f()` instantiates that machine. <br>

ralfbiedert avatar Mar 24 '21 22:03 ralfbiedert

That's from a change from 0.12 in how shortcodes are integrated I believe. In that case it will replace the shortcode with it with a <pre> and since it starts with HTML, the Markdown parser ignores the rest.

I'm not sure how to fix that tbh, the alternative was randomly breaking shortcodes depending on the indentation. I believe you can pass the text to the shortcode and use the markdown filter in there.

A minimum feature could be to simply warn if the rendered HTML contains xxx, xxx or xxx sequences, maybe as part of zola check (maybe even with an extra option zola check --xxx.

That would have to be an option otherwise every blog post talking about Markdown would fail it.

Keats avatar Mar 25 '21 09:03 Keats

I'm not sure if it would fall under this issue or a separate one, but it'd also be nice to have a warning (or a flag to enable a warning) to alert on malformed shortcodes instead of silently ignoring them.

ian-fox avatar Jun 12 '21 02:06 ian-fox

alert on malformed shortcodes

How do we know whether it's a malformed shortcode though?

Keats avatar Jun 12 '21 18:06 Keats

I'll do some testing, but I think just at a glance alerting on the error from https://github.com/getzola/zola/blob/master/components/rendering/src/shortcode.rs#L126 might be helpful. Could also do a search afterwards for {% as was mentioned in the original issue.

Will look into what errors tera throws as well, possibly could address the problem at that level.

ian-fox avatar Jun 16 '21 04:06 ian-fox

I still don't know if there is a good way to do that.

Keats avatar Dec 05 '21 20:12 Keats

How about zola check --assert-no-strange-patterns (or similar setting in the config.toml; name TBD) which via regex checks there are no patterns of type

  • (...)[...]
  • `...`
  • *...*
  • _..._
  • whatever a malformed short code looks like, {{ } and { }}?

where ... would mostly be alphanumerical and white space. In particular any <> inside ... would make the test pass. That list could be expanded later, and people authoring blogs about Markdown would just not enable that option. I even think the logic could be inverted right from the start, like --ignore-strange-patterns, since most people write about other things than markdown.

ralfbiedert avatar Dec 05 '21 21:12 ralfbiedert

The issue is that you can also have HTML or any kind of script in the Markdown file, doing it via regex is going to be very error prone

Keats avatar Dec 06 '21 12:12 Keats

I might be missing something, but I've sketched what I mean here:

https://github.com/ralfbiedert/rust_issues/tree/zola_unprocessed_markdown

The idea is to process the generated HTML and search for text nodes with broken patterns. I've tested this with two versions of my site, the clean one is the vanilla version of my site, and the broken one has a simulated issue as in my original ticket.

ralfbiedert avatar Dec 09 '21 21:12 ralfbiedert

https://github.com/ralfbiedert/rust_issues/commit/6cd1f1b370ba908f5f8d5893739978c897898f12#diff-42cb6807ad74b3e201c5a7ca98b911c5fa08380e942be6e4ac5807f8377f87fcR65-R68 are all valid. it could be the user decided to put <p>*some text*</p> in their markdown. Or even a shortcode outputting some JS like

/* does foo */
function() {}

wouldn't pass that test. I don't think there is a way to detect unprocessed content reliably so I wouldn't try it.

Keats avatar Dec 09 '21 22:12 Keats

I buy that I might have a niche use case, that the effort of implementing these checks is (currently) high and that they shouldn't be enabled by default.

However, if you ever implemented a modular checking system where these things could be added and enabled one-by-one, then I'd happily enable most of them, as they seemingly (compare link above) would have been an effective measure preventing some bugs on my site.

Now that I think about it ... I think of these more of a "clippy" system of quality lints to opt into, which may have false positives, but people can adjust them to the needs of their site. But anyway, no worries, it would just be a nice to have feature for me, not a critical one.

ralfbiedert avatar Dec 09 '21 22:12 ralfbiedert

Can that be done with ripgrep on the public/ directory in the meantime? The issue with false positives linting is that you need a way to skip some so that means adding some new concept to the markdown parser just for that.

Keats avatar Dec 13 '21 19:12 Keats

Can that be done with ripgrep on the public/ directory in the meantime?

Yes, I think it must to be done on public as it needs both an HTML parser like visdom to walk text nodes, and regex to check node content (at least if you want to minimize false positives).

so that means adding some new concept to the markdown parser just for that.

I don't really know zola architecture, but extending lint approach it would be fantastic if zola could also check the generated output, which could allow you to:

  • verify the HTML is well formed (I've shipped sites with missing closing tags few times already)
  • local src or meta attributes point to existing resources (same, esp. for more subtle CSS + fonts)

In other words, there might be some features / correctness benefits for not just looking at the markdown files.

ralfbiedert avatar Dec 13 '21 21:12 ralfbiedert