selectolax icon indicating copy to clipboard operation
selectolax copied to clipboard

text() always decodes HTML entities

Open kiwijam opened this issue 5 years ago • 5 comments

As far as I can tell, there's no easy way to extract text but preserve HTML entity encoding at the moment.

Having that option would be handy!

from selectolax.parser import HTMLParser
from html import escape

html = HTMLParser('<div>&#x3C;test&#x3E;</div>')
print(html.text())
print(escape(html.text()))

kiwijam avatar Jun 11 '20 13:06 kiwijam

I think I can't control it, since Modest performs some preprocessing but I can be wrong.

rushter avatar Jun 12 '20 15:06 rushter

@kiwijam @rushter

In Modest we have buffer positions for attributes in tokens You can use this for get raw data.

lexborisov avatar Jun 13 '20 15:06 lexborisov

Added limited support for this in 0.2.7.

>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'

This is limited to text nodes only for now.

rushter avatar Aug 15 '20 17:08 rushter

Added limited support for this in 0.2.7.

>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'

This is limited to text nodes only for now.

Thanks for your work done. How can I join in the maintenance of the library. I would like to be of help so that more features can be added.

ichux avatar Aug 15 '20 17:08 ichux

Added limited support for this in 0.2.7.

>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'

This is limited to text nodes only for now.

Thanks for your work done. How can I join in the maintenance of the library. I would like to be of help so that more features can be added.

Well, It's open-source. You are welcome to propose new features or improve existing ones.

You can improve the new raw_value feature to support arbitrary nodes. That's a pretty easy task, but you will need to be familiar with the C language and Modest library though.

rushter avatar Aug 15 '20 17:08 rushter