openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Merge Author Preview not rendering html entities correctly (e.g. quote, apostrophe)

Open LeadSongDog opened this issue 9 months ago • 7 comments

Problem

The descriptive text for an author record is shown differently between the author page and the merge authors page. On the latter, the rendering of escaped characters is defective, and the former is truncated.

Image Image

Reproducing the bug

  1. Go to https://openlibrary.org/search/authors?q=Shakesp%2A&sort=work_count+desc
  2. Examine the first description
  3. Click “Merge authors”
  4. Compare the first description
  • Expected behavior: They should be the same.

  • Actual behavior: They differ, suggesting an opportunity to DRY up.

Context

  • Browser (Chrome, Safari, Firefox, etc): Safari
  • OS (Windows, Mac, etc): iPadOS
  • Logged in (Y/N): Y
  • Environment (prod, dev, local): prod

Breakdown

Requirements Checklist

  • [ ]

Related files

Stakeholders


Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

LeadSongDog avatar Mar 28 '25 16:03 LeadSongDog

I would like to pick this up! I am currently working on a proposal for the GSoc Ideas for OpenLibrary and this would be a good way to get a hang of the codebase @mekarpeles

gsreeja11 avatar Apr 06 '25 08:04 gsreeja11

Hi @gsreeja11 , go for it :+1:

The way to tackle this is to update https://github.com/internetarchive/openlibrary/blob/baf8abe5010821203d6e728ab52108a0f5fe6514/openlibrary/templates/merge/authors.html#L90 to pass in html_unescape(a.bio) instead of the raw a.bio.

This makes use of the built-in python method, https://docs.python.org/3/library/html.html#html.unescape

Note you will need to add a new entry for html_unescape to this list here to make it available in templates:

https://github.com/internetarchive/openlibrary/blob/f56841a768070747992b5c9275154af8fbba6fe5/openlibrary/plugins/upstream/utils.py#L1669-L1677

cdrini avatar Apr 07 '25 16:04 cdrini

Thanks a lot @cdrini for the suggestions, will raise a PR tomorrow, I have the local set up done :)

gsreeja11 avatar Apr 07 '25 19:04 gsreeja11

@cdrini a question here, when I add the unescape to the temple like web.template.Template.globals.update(

        {
            'HTML': HTML,
            'request': Request(),
            'logger': logging.getLogger("openlibrary.template"),
            'sum': sum,
            'websafe': web.websafe,
            'html_unescape': unescape
        }

And use it as below <p>${ macros.TruncateString(html_unescape(a.bio), 250) }</p>

I still see the ;quot etc. rendering issue.

However if I just use <p>${ html_unescape(a.bio) }</p> I see everything correctly as expected , quotes are properly rendered etc.

Is there something I need to tackle in macros.TruncateString ? (or perhaps because of Tuples?) or somewhere where it is getting escaped again

gsreeja11 avatar Apr 10 '25 18:04 gsreeja11

Has this issue been addressed? I've been perusing the comments looking for a trail to a pr that maybe have addressed it. If not, I'm interested in pursuing!

deyshift avatar Apr 29 '25 15:04 deyshift

Hi @rivtechprojects , @gsreeja11 is working on this one :+1:

Apologies for the delay @gsreeja11 . Ah that's a tricky case, I'm not sure why it's getting un-escaped again 🤔 I thought the issue was in a.bio itself, but it seems like the encoding breaks for another reason while moving through the different passes.

Hmm, maybe we can do something like this:

   <p>$:html_safe_unescape(macros.TruncateString(html_unescape(a.bio), 250))</p>

Where html_safe_unescape is a new method that does:

@public
def html_safe_unescape(s: str) -> str:
    """Un-escapes everything except < and > to avoid XSS"""
    return html.escape(html.unescape(s), quote=False)

That should hopefully do the trick! And you can add this method to openlibrary/plugins/upstream/utils.py

cdrini avatar May 08 '25 14:05 cdrini

Actually you might be able to just switch to using:

 <p>$:macros.TruncateString(a.bio, 250)</p> 

By using $:, we output raw HTML. This might actually work... Just run a test to make sure we don't introduce an XSS vulnerability. You can test this by adding some javascript to an author description, like <script>alert('hi')</script>. If you the alert shows when the description is rendered, then it didn't work, and we'll need to instead do

 <p>$:html_safe_unescape(macros.TruncateString(a.bio, 250))</p> 

cdrini avatar May 08 '25 14:05 cdrini

Hi! I see this issue was started by @gsreeja11 but hasn't had activity since May. I'd like to pick this up and complete the work.

@cdrini provided a clear solution - I'll implement the html_safe_unescape approach and test for XSS vulnerabilities as suggested. Will have a PR ready soon!

Anuradha-Agrawal-07 avatar Oct 07 '25 06:10 Anuradha-Agrawal-07

Sweet, go for it @Anuradha-Agrawal-07 ! Although not I believe this was trickier than it appeared ; I thought there was a draft pull request about this somewhere but can't seem to find it! Let us know if you hit any issues, @Anuradha-Agrawal-07 !

cdrini avatar Oct 13 '25 20:10 cdrini

Hi @cdrini, I've been following this thread and noticed there hasn't been any recent activity. I'd love to pick this up for my open-source class if it's still available. Thanks!

linhkhanhhoang avatar Nov 17 '25 23:11 linhkhanhhoang