Merge Author Preview not rendering html entities correctly (e.g. quote, apostrophe)
Problem
The descriptive text for an author record is shown differently between the author page and the merge authors page. On the latter, the rendering of escaped characters is defective, and the former is truncated.
Reproducing the bug
- Go to https://openlibrary.org/search/authors?q=Shakesp%2A&sort=work_count+desc
- Examine the first description
- Click “Merge authors”
- Compare the first description
-
Expected behavior: They should be the same.
-
Actual behavior: They differ, suggesting an opportunity to DRY up.
Context
- Browser (Chrome, Safari, Firefox, etc): Safari
- OS (Windows, Mac, etc): iPadOS
- Logged in (Y/N): Y
- Environment (prod, dev, local): prod
Breakdown
Requirements Checklist
- [ ]
Related files
Stakeholders
Instructions for Contributors
- Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
I would like to pick this up! I am currently working on a proposal for the GSoc Ideas for OpenLibrary and this would be a good way to get a hang of the codebase @mekarpeles
Hi @gsreeja11 , go for it :+1:
The way to tackle this is to update https://github.com/internetarchive/openlibrary/blob/baf8abe5010821203d6e728ab52108a0f5fe6514/openlibrary/templates/merge/authors.html#L90 to pass in html_unescape(a.bio) instead of the raw a.bio.
This makes use of the built-in python method, https://docs.python.org/3/library/html.html#html.unescape
Note you will need to add a new entry for html_unescape to this list here to make it available in templates:
https://github.com/internetarchive/openlibrary/blob/f56841a768070747992b5c9275154af8fbba6fe5/openlibrary/plugins/upstream/utils.py#L1669-L1677
Thanks a lot @cdrini for the suggestions, will raise a PR tomorrow, I have the local set up done :)
@cdrini a question here, when I add the unescape to the temple like web.template.Template.globals.update(
{
'HTML': HTML,
'request': Request(),
'logger': logging.getLogger("openlibrary.template"),
'sum': sum,
'websafe': web.websafe,
'html_unescape': unescape
}
And use it as below
<p>${ macros.TruncateString(html_unescape(a.bio), 250) }</p>
I still see the ;quot etc. rendering issue.
However if I just use
<p>${ html_unescape(a.bio) }</p>
I see everything correctly as expected , quotes are properly rendered etc.
Is there something I need to tackle in macros.TruncateString ? (or perhaps because of Tuples?) or somewhere where it is getting escaped again
Has this issue been addressed? I've been perusing the comments looking for a trail to a pr that maybe have addressed it. If not, I'm interested in pursuing!
Hi @rivtechprojects , @gsreeja11 is working on this one :+1:
Apologies for the delay @gsreeja11 . Ah that's a tricky case, I'm not sure why it's getting un-escaped again 🤔 I thought the issue was in a.bio itself, but it seems like the encoding breaks for another reason while moving through the different passes.
Hmm, maybe we can do something like this:
<p>$:html_safe_unescape(macros.TruncateString(html_unescape(a.bio), 250))</p>
Where html_safe_unescape is a new method that does:
@public
def html_safe_unescape(s: str) -> str:
"""Un-escapes everything except < and > to avoid XSS"""
return html.escape(html.unescape(s), quote=False)
That should hopefully do the trick! And you can add this method to openlibrary/plugins/upstream/utils.py
Actually you might be able to just switch to using:
<p>$:macros.TruncateString(a.bio, 250)</p>
By using $:, we output raw HTML. This might actually work... Just run a test to make sure we don't introduce an XSS vulnerability. You can test this by adding some javascript to an author description, like <script>alert('hi')</script>. If you the alert shows when the description is rendered, then it didn't work, and we'll need to instead do
<p>$:html_safe_unescape(macros.TruncateString(a.bio, 250))</p>
Hi! I see this issue was started by @gsreeja11 but hasn't had activity since May. I'd like to pick this up and complete the work.
@cdrini provided a clear solution - I'll implement the html_safe_unescape approach and test for XSS vulnerabilities as suggested. Will have a PR ready soon!
Sweet, go for it @Anuradha-Agrawal-07 ! Although not I believe this was trickier than it appeared ; I thought there was a draft pull request about this somewhere but can't seem to find it! Let us know if you hit any issues, @Anuradha-Agrawal-07 !
Hi @cdrini, I've been following this thread and noticed there hasn't been any recent activity. I'd love to pick this up for my open-source class if it's still available. Thanks!