Certain characters in inline code incorrectly parsed (e.g., `&`)
MWE:
import mistune
from mistune.core import BlockState
markdown = mistune.create_markdown(renderer="ast")
md = r"`&<>`"
tokens = markdown(md)
print(tokens)
Output:
[{'type': 'paragraph', 'children': [{'type': 'codespan', 'raw': '&<>'}]}]
We also encountered this. The cause is 8452faf345a152a149e7f79243fd9693a06ed0e9, more specifically this change, I think. Putting HTML escaping into the parser stage, independently of the output format, is incorrect.
I also encountered this, originally found in https://github.com/omnilib/sphinx-mdinclude/issues/19. I did some digging and found that the issue lays in mistune.util.escape_url indeed as @torokati44 points towards, but only because its behavior is different depending on whether html is installed (this happens in the import, at least of mistune 2.0.5):
https://github.com/lepture/mistune/blob/cb580e89e67ef9b4827daf43afd2058bdf5e58d2/mistune/util.py#L22-L31
Some code to reproduce:
from urllib.parse import quote
import html
link = "https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&metric=alert_status"
# code from: def escape_url(link):
safe = (
':/?#@' # gen-delims - '[]' (rfc3986)
'!$&()*+,;=' # sub-delims - "'" (rfc3986)
'%' # leave already-encoded octets alone
)
out_nonhtml = quote(link.encode('utf-8'), safe=safe)
out_withhtml = html.escape(quote(html.unescape(link), safe=safe))
out_withhtml_noescape = quote(html.unescape(link), safe=safe)
print(out_nonhtml)
print(out_withhtml)
print(out_withhtml_noescape)
This gives different results. The first one is returned if html is not installed, the second one if html is installed. The third one is correct again:
https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&metric=alert_status
https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&metric=alert_status
https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&metric=alert_status
Therefore, what I think should fix the issue is to remove the html.escape() from the escape_url function so the behaviour is always consistent and according to expectations.
Update for newer mistune versions
The escape_url function looks different in the master branch (>3.0.0):
https://github.com/lepture/mistune/blob/93fd197c6c24353011974110378b738596cde5a1/src/mistune/util.py#L32-L39
In this case omitting escape does the trick.
Also today, in the master branch, & still gets converted to &. I am not sure whether escape should just be removed from escape_url, since that does not seem to make sense given the function name. However, this was the behavior before if html was not installed with return quote(link.encode('utf-8'), safe=safe), so it might also just be the fix. Either way, it would be great if & markdown urls is not converted.
I'm going down the rabbit hole from jupyter-book > sphinx > myst > ??? here, maybe? I'm getting & in markdown links that include query parameters. Any workaround?
I think what you are encountering is because of the bug described in this issue indeed. No workaround as far as I know, @lepture could you ahare your thoughts on this discussion?
@itcarroll actually, Jupyter's MyST uses markdown-it-py, not Mistune.
@veenstrajelmer I think you've spotted a separate issue. Similar, but independent code paths. The premature HTML-escaping of inline code just uses escape, totally bypassing escape_url.
@mentalisttraceur ok fair, but will this issue be resolved? I see limited response from maintainers of this repository, so I am not sure what to expect. If it is a different issue, does it help if I create a different issue, or will it not matter too much? I am a bit hesitant in putting even more investigation time in this package, because of te limited response. My issue is just a single readme badge that does not work in the docs of all my packages (e.g. https://deltares.github.io/dfm_tools). I can relatively easily convert all the readme's from markdown to rst, but only raised this issue since I prefer markdown.
@veenstrajelmer "not having this discussion here", would matter at least for us that are "actively" participating on this issue 🤷 You can also clearly see that someone decided to step up and fix this issue (after this issue being stale for a while).
I would be optimistic - as long as issues don't get "more tangled".
I was not intending to tangle issues, just noticed this issue that seems pretty much as what I am running into. I do not see how they are caused and therefore also not see how they are unrelated. To me they both come down to incorrect character escaping in mistune, but apparently they have different causes. Either way, I have created another issue here: https://github.com/lepture/mistune/issues/394.