this-week-in-rust icon indicating copy to clipboard operation
this-week-in-rust copied to clipboard

malformed characters in RSS and Atom feeds?

Open aetherknight opened this issue 1 month ago • 9 comments

Both the RSS and Atom feeds starting for today's (November 19, 2025) issue seem to be broken and fail XML validation. My FreshRSS instance, as well as Firefox, complain about the XML not being well-formed.

For example, Firefox reports:

XML Parsing Error: not well-formed Location: https://this-week-in-rust.org/atom.xml Line Number 121, Column 322:

Firefox shows:

Image Image

If I view source, I see this oddity:

Image

aetherknight avatar Nov 20 '25 03:11 aetherknight

I suspect it's caused by https://github.com/rust-lang/this-week-in-rust/pull/7276/files

aetherknight avatar Nov 21 '25 01:11 aetherknight

I can confirm I also see this issue. Just got the website building on my machine, I'll see what I can find.

chris-t-jansen avatar Nov 21 '25 22:11 chris-t-jansen

Removing the contents of #7276 does fix the RSS feed issue on my machine, I'll see if I can fix the contents to prevent the RSS feed issue.

chris-t-jansen avatar Nov 21 '25 22:11 chris-t-jansen

I believe I've fixed this in #7305. I would love if someone else could confirm that they can reproduce the original issue and that my solution works.

chris-t-jansen avatar Nov 21 '25 23:11 chris-t-jansen

I mentioned this in the PR, but I do think there should probably be an automated test to catch this sort of thing if it's subtle enough to slip under the radar but severe enough to cause the RSS feed to crash. I'm not smart enough to know where that would go, but if someone can point me in the right direction, I'm happy to try my hand at it.

chris-t-jansen avatar Nov 21 '25 23:11 chris-t-jansen

Thank you for the debugging. I wonder when the change can be merged in.

However, markdown should not be able to generate invalid RSS/ATOM. But I am not sure what kind of dialect this is as your change in #7305 seems to be around a header-less table.

I can see in requirements.txt that pelican==4.7.1 is used and that https://github.com/getpelican/pelican/releases/tag/4.11.0 was released January 2025. Perhaps the bug is fixed in the new version? I tried to look at pelican's commits, but I did not find any smoking gun.

per-oestergaard avatar Nov 24 '25 16:11 per-oestergaard

Good investigation, all, we would welcome someone submitting a PR with a test to catch this in the future.

nellshamrell avatar Nov 25 '25 01:11 nellshamrell

I've done some testing, and this bug is an interesting one. Don't have all the details, but figured I'd share what I've found and hopefully someone smarter than me can figure out what's happening.

To put it in absolute terms, the issue is caused by Markdown links with bolded text inside an HTML comment on the same line after parsed Markdown. I realize that's a word salad, so let me explain.

Replication

To replicate, pick any line of text from a TWiR newsletter and add the following HTML comment anywhere in the middle or end:

<!-- [**TEXT**](LINK) -->

For example, picking one of the first lines from the 2025-11-19 issue, it might look like this:

This is a weekly summary of its progress and community. <!-- [**TEXT**](LINK) -->

This results in the error seen in the screenshots at the start of this issue. On my browser, the error reads This page contains the following errors: error on line 4 at column 86: PCDATA invalid Char value 2.

Underlying Cause

@aetherknight hit the nail on the head with their last screenshot. Looking at the page source reveals this weird string: klzzwxh:0001, which looks like this in the context of the above example:

This is a weekly summary of its progress and community. &lt;!-- &lt;a href="LINK"&gt;klzzwxh:0001&lt;/a&gt; --&gt;

[!IMPORTANT] In case you're not familiar, &lt; and &gt; are HTML entity names for < and > respectively. You'll see them again in this investigation, so keep that in mind.

In my testing, the four-digit number there changes, but the string characters are always the same klzzwxh:.

Depending on your machine, the characters on either end of that strange string may render differently; for me they render as question marks on GitHub, but they can also render as a space, a very tiny STX, or nothing at all. Regardless of how it renders though, the Unicode character is the same: U+0002: Start Of Text. This "Start Of Text" (STX) character is the offending one crashing the RSS feed.

If you're like me, you may immediately jump to putting that HTML comment into a Unicode inspector to find where I've hidden the STX, but it's not in there. It gets inserted somewhere in the process of turning the Markdown files that underly each TWiR issue into RSS-compatible XML.

This point on is mostly conjecture, but I think it's along the right lines. Since Markdown is ultimately a text-to-HTML syntax, it usually becomes HTML at the end of the day for actual rendering/display, which is why just inlining HTML works in Markdown (e.g. line breaks <br />). This means that Markdown syntax is effectively just shorthand for HTML tags, and usually gets processed that way (e.g. * becomes <li>, # becomes <h1>, etc.).

However, the exception to that should be HTML comments (<!-- ... -->). Since the contents of the comment never get rendered, they shouldn't get parsed into syntax, and should be left as they're written. We see this working properly for most of the comments in the TWiR template (of which there are helpfully many!), such as this one in the CFP - Events section where this all started:

&lt;!-- CFPs go here, use this format: * [**event name**](URL to CFP)| Date CFP closes in YYYY-MM-DD | city,state,country | Date of event in YYYY-MM-DD --&gt;

You can see the Markdown link syntax in that comment ([link text](url)), and you'll notice that it hasn't been parsed into an HTML-appropriate <a> tag.

However, looking back at the output from our replication of the issue, we can see that the Markdown link in the comment has tried to be converted into HTML:

This is a weekly summary of its progress and community. &lt;!-- &lt;a href="LINK"&gt;klzzwxh:0001&lt;/a&gt; --&gt;

Or, with the < and > substituted back in:

This is a weekly summary of its progress and community. <!-- <a href="LINK">klzzwxh:0001</a> -->

Fixes

As I said at the top if this comment, this only occurs in Markdown links inside HTML comments with bolded link text that aren't at the start of the line. As such, the easy ways to fix this are to remove the bold on the link text, or to move the comment to its own line. Both of the following variations on the original example are perfectly acceptable and cause no errors.

This is a weekly summary of its progress and community. <!-- [TEXT](LINK) -->
This is a weekly summary of its progress and community.
<!-- [**TEXT**](LINK) -->

Going back to the commit that originally started this investigation (#7276), let's look at the line the commit added:

* [**Rustikon 2026**](https://sessionize.com/rustikon-2026/) \| CFP closes: 2025-11-24 23:59 \| Warsaw, Poland \| Event: 2025-03-19–2025-03-20 [Event website](https://www.rustikon.dev/)<!-- CFPs go here, use this format: * [**event name**](URL to CFP)| Date CFP closes in YYYY-MM-DD | city,state,country | Date of event in YYYY-MM-DD -->

Knowing what we know now, we can trivially fix this by simply removing the comment, or by applying one of the simple fixes I mentioned:

* [**Rustikon 2026**](https://sessionize.com/rustikon-2026/) \| CFP closes: 2025-11-24 23:59 \| Warsaw, Poland \| Event: 2025-03-19–2025-03-20 [Event website](https://www.rustikon.dev/)<!-- CFPs go here, use this format: * [event name](URL to CFP)| Date CFP closes in YYYY-MM-DD | city,state,country | Date of event in YYYY-MM-DD -->
* [**Rustikon 2026**](https://sessionize.com/rustikon-2026/) \| CFP closes: 2025-11-24 23:59 \| Warsaw, Poland \| Event: 2025-03-19–2025-03-20 [Event website](https://www.rustikon.dev/)
<!-- CFPs go here, use this format: * [**event name**](URL to CFP)| Date CFP closes in YYYY-MM-DD | city,state,country | Date of event in YYYY-MM-DD -->

Next Steps

Ultimately, this seems like a bug with one (or multiple) of the parsers that convert the Markdown for the TWiR issue into RSS-compatible XML. This issue should probably be forwarded to them, and then TWiR can update that dependency and avoid this problem altogether.

In the meantime, writing a test for this could be tricky, as the offending character isn't in the source code, and the bug that causes it seems to require that four separate conditions be met simultaneously. Checking for all four, and then advising authors to change one of the four would probably result in pretty vague feedback. As such, I'd probably recommend a test that simply requires HTML comments that contain Markdown links with bolded link text to start on their own line. It's still a bit messy, but hopefully less vague. I'm open to other solutions, though!

If you have any questions, or if I haven't explained something clearly, feel free to let me know. I'm just a random dude doing his best to pitch in.

chris-t-jansen avatar Nov 25 '25 17:11 chris-t-jansen

Looks like this issue might get resolved by Python-Markdown/markdown#1572, at which point TWiR can update the dependency and not have this problem.

chris-t-jansen avatar Dec 01 '25 14:12 chris-t-jansen