selectolax text() always decodes HTML entities

As far as I can tell, there's no easy way to extract text but preserve HTML entity encoding at the moment.

Having that option would be handy!

from selectolax.parser import HTMLParser
from html import escape

html = HTMLParser('<div>&#x3C;test&#x3E;</div>')
print(html.text())
print(escape(html.text()))

Jun 11 '20 13:06 kiwijam

I think I can't control it, since Modest performs some preprocessing but I can be wrong.

Jun 12 '20 15:06 rushter

@kiwijam @rushter

In Modest we have buffer positions for attributes in tokens You can use this for get raw data.

Jun 13 '20 15:06 lexborisov

Added limited support for this in 0.2.7.

>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'

This is limited to text nodes only for now.

Aug 15 '20 17:08 rushter

Added limited support for this in 0.2.7.

>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'

This is limited to text nodes only for now.

Thanks for your work done. How can I join in the maintenance of the library. I would like to be of help so that more features can be added.

Aug 15 '20 17:08 ichux

Added limited support for this in 0.2.7.
>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'
This is limited to text nodes only for now.
Thanks for your work done. How can I join in the maintenance of the library. I would like to be of help so that more features can be added.

Well, It's open-source. You are welcome to propose new features or improve existing ones.

You can improve the new raw_value feature to support arbitrary nodes. That's a pretty easy task, but you will need to be familiar with the C language and Modest library though.

Aug 15 '20 17:08 rushter

selectolax selectolax copied to clipboard

text() always decodes HTML entities

selectolax
selectolax copied to clipboard