selectolax
selectolax copied to clipboard
text() always decodes HTML entities
As far as I can tell, there's no easy way to extract text but preserve HTML entity encoding at the moment.
Having that option would be handy!
from selectolax.parser import HTMLParser
from html import escape
html = HTMLParser('<div><test></div>')
print(html.text())
print(escape(html.text()))
I think I can't control it, since Modest performs some preprocessing but I can be wrong.
@kiwijam @rushter
In Modest we have buffer positions for attributes in tokens You can use this for get raw data.
Added limited support for this in 0.2.7.
>>> html_parser = HTMLParser('<div><test></div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'<test>'
>>> selector.child.raw_value
b'<test>'
This is limited to text nodes only for now.
Added limited support for this in 0.2.7.
>>> html_parser = HTMLParser('<div><test></div>') >>> selector = html_parser.css_first('div') >>> selector.child.html '<test>' >>> selector.child.raw_value b'<test>'This is limited to text nodes only for now.
Thanks for your work done. How can I join in the maintenance of the library. I would like to be of help so that more features can be added.
Added limited support for this in 0.2.7.
>>> html_parser = HTMLParser('<div><test></div>') >>> selector = html_parser.css_first('div') >>> selector.child.html '<test>' >>> selector.child.raw_value b'<test>'This is limited to text nodes only for now.
Thanks for your work done. How can I join in the maintenance of the library. I would like to be of help so that more features can be added.
Well, It's open-source. You are welcome to propose new features or improve existing ones.
You can improve the new raw_value feature to support arbitrary nodes.
That's a pretty easy task, but you will need to be familiar with the C language and Modest library though.