jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

🚨 Jsoup - Entities are not recognized properly and `&shy` is not treated like other entities

Open Muthukirthan opened this issue 1 year ago • 0 comments

Case1

Input: <p>a&nbspc</p> Brower result: a c &nbsp is recognized as &nbsp; html entity

Jsoup parsed content: <p>a&amp;nbspc</p> Brower result: a&nbspc &nbsp is not recognized which shows different result in browser


Case2

Input: <p>a&nbsp&shyc</p> Brower result: a ­c &nbsp and &shy is recognized as &nbsp; and &shy; respective html entity

Jsoup parsed content: <p>a&nbsp;&amp;shyc</p> Brower result: a &shyc &nbsp is recognized (might be due to succeeding & character), but &shy is not recognized as &shy;. Shows different result in browser


Case3

Input: <p>a&shyc&nbsp</p> Brower result: a­c &nbsp and &shy is recognized as &nbsp; and &shy; respective html entity

Jsoup parsed content: <p>a&amp;shyc&nbsp;</p> Brower result: a&shyc &nbsp is recognized (might be due to succeeding & character), but &shy is not recognized as &shy;. Shows different result in browser


All the cases have unexpected result. Additionally &shy has different results when compared to others.

Parser: Html parser Escape mode: Same result for both base and extended. nbsp entity is replaced by &#xa0; in xhtml escape mode but the result is same

I also raised this doubt related to entity: https://github.com/jhy/jsoup/issues/2206

@jhy

Muthukirthan avatar Oct 06 '24 07:10 Muthukirthan