mistune icon indicating copy to clipboard operation
mistune copied to clipboard

Certain characters in inline code incorrectly parsed (e.g., `&`)

Open nschloe opened this issue 2 years ago • 5 comments

MWE:

import mistune
from mistune.core import BlockState
markdown = mistune.create_markdown(renderer="ast")

md = r"`&<>`"
tokens = markdown(md)

print(tokens)

Output:

[{'type': 'paragraph', 'children': [{'type': 'codespan', 'raw': '&amp;&lt;&gt;'}]}]

nschloe avatar Nov 21 '23 08:11 nschloe

We also encountered this. The cause is 8452faf345a152a149e7f79243fd9693a06ed0e9, more specifically this change, I think. Putting HTML escaping into the parser stage, independently of the output format, is incorrect.

torokati44 avatar May 29 '24 12:05 torokati44

I also encountered this, originally found in https://github.com/omnilib/sphinx-mdinclude/issues/19. I did some digging and found that the issue lays in mistune.util.escape_url indeed as @torokati44 points towards, but only because its behavior is different depending on whether html is installed (this happens in the import, at least of mistune 2.0.5): https://github.com/lepture/mistune/blob/cb580e89e67ef9b4827daf43afd2058bdf5e58d2/mistune/util.py#L22-L31

Some code to reproduce:

from urllib.parse import quote
import html

link = "https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&metric=alert_status"

# code from: def escape_url(link):
safe = (
    ':/?#@'           # gen-delims - '[]' (rfc3986)
    '!$&()*+,;='      # sub-delims - "'" (rfc3986)
    '%'               # leave already-encoded octets alone
)
out_nonhtml = quote(link.encode('utf-8'), safe=safe)
out_withhtml = html.escape(quote(html.unescape(link), safe=safe))
out_withhtml_noescape = quote(html.unescape(link), safe=safe)

print(out_nonhtml)
print(out_withhtml)
print(out_withhtml_noescape)

This gives different results. The first one is returned if html is not installed, the second one if html is installed. The third one is correct again:

https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&metric=alert_status
https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&amp;metric=alert_status
https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&metric=alert_status

Therefore, what I think should fix the issue is to remove the html.escape() from the escape_url function so the behaviour is always consistent and according to expectations.

Update for newer mistune versions The escape_url function looks different in the master branch (>3.0.0): https://github.com/lepture/mistune/blob/93fd197c6c24353011974110378b738596cde5a1/src/mistune/util.py#L32-L39 In this case omitting escape does the trick.

veenstrajelmer avatar Jul 05 '24 10:07 veenstrajelmer

Also today, in the master branch, & still gets converted to &amp;. I am not sure whether escape should just be removed from escape_url, since that does not seem to make sense given the function name. However, this was the behavior before if html was not installed with return quote(link.encode('utf-8'), safe=safe), so it might also just be the fix. Either way, it would be great if & markdown urls is not converted.

veenstrajelmer avatar Oct 04 '24 09:10 veenstrajelmer

I'm going down the rabbit hole from jupyter-book > sphinx > myst > ??? here, maybe? I'm getting &amp; in markdown links that include query parameters. Any workaround?

itcarroll avatar Oct 23 '24 16:10 itcarroll

I think what you are encountering is because of the bug described in this issue indeed. No workaround as far as I know, @lepture could you ahare your thoughts on this discussion?

veenstrajelmer avatar Oct 23 '24 17:10 veenstrajelmer

@itcarroll actually, Jupyter's MyST uses markdown-it-py, not Mistune.

mentalisttraceur avatar Oct 31 '24 16:10 mentalisttraceur

@veenstrajelmer I think you've spotted a separate issue. Similar, but independent code paths. The premature HTML-escaping of inline code just uses escape, totally bypassing escape_url.

mentalisttraceur avatar Oct 31 '24 16:10 mentalisttraceur

@mentalisttraceur ok fair, but will this issue be resolved? I see limited response from maintainers of this repository, so I am not sure what to expect. If it is a different issue, does it help if I create a different issue, or will it not matter too much? I am a bit hesitant in putting even more investigation time in this package, because of te limited response. My issue is just a single readme badge that does not work in the docs of all my packages (e.g. https://deltares.github.io/dfm_tools). I can relatively easily convert all the readme's from markdown to rst, but only raised this issue since I prefer markdown.

veenstrajelmer avatar Nov 01 '24 12:11 veenstrajelmer

@veenstrajelmer "not having this discussion here", would matter at least for us that are "actively" participating on this issue 🤷 You can also clearly see that someone decided to step up and fix this issue (after this issue being stale for a while).

I would be optimistic - as long as issues don't get "more tangled".

stdedos avatar Nov 01 '24 13:11 stdedos

I was not intending to tangle issues, just noticed this issue that seems pretty much as what I am running into. I do not see how they are caused and therefore also not see how they are unrelated. To me they both come down to incorrect character escaping in mistune, but apparently they have different causes. Either way, I have created another issue here: https://github.com/lepture/mistune/issues/394.

veenstrajelmer avatar Nov 01 '24 13:11 veenstrajelmer