parsel
parsel copied to clipboard
Add option to retrieve text content
As a scrapy user, I often want to extract the text content of an element. The default option in parsel is to either use the ::text
pseudo-element or XPath text()
. Both options have the downside that they return all text nodes as individual elements. When the element contains child elements, this creates unwanted behavior. E.g.:
<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
>>> response.css('h2::text').extract()
['This is the ', ' trend!']
>>> response.css('.post_info::text').extract()
['Published by newbie', 'on Sept 17']
With a basic understanding of XML and XPath, this behavior is expected. But it requires overhead to work around it, and it often creates frustrations with new users. There is a series of questions on stackoverflow as well as on the scrapy bug tracker:
- https://stackoverflow.com/questions/33088402/extracting-text-within-em-tag-in-scrapy
- https://stackoverflow.com/questions/23156780/how-can-i-get-all-the-plain-text-from-a-website-with-scrapy
- https://stackoverflow.com/questions/39511122/extract-nested-tags-with-other-text-data-as-string-in-scrapy
- https://github.com/scrapy/scrapy/issues/3488
lxml.html
has the convenience method .text_content()
that collects all of the text content of an element. Somethings similar could be added to the Selector
and SelectorList
classes. I could imagine two ways to approach the required API:
- Either, there could be additional
.extract_text()
/.get_text()
methods. This seems clean and easy to use, but would lead to potentially convoluted method names like.extract_first_text()
(or.extract_text_first()
?). - Or add a parameter to
.extract*()
/.get()
, similar to the proposal in #101. This could be.extract(format_as='text')
. This is less intrusive, but maybe less easy to discover.
Would such an addition be welcome? I could prepare a patch.
Hey @frederik-elwert! This is being worked on here: https://github.com/scrapy/parsel/pull/127 :)
Please consider this as basic feature and add It.
+1
Any progress on this issue?
Not much, but I've merged master to #127 yesterday, so the PR is up-to-date now. I think feature-wise it is ready; I'm happy with the implementation. But it needs some cleanup - more docs and tests.
Any progress on this issue?
This still hasn't been addressed?
One working option Is to use.. chaining css calls with *::text
query applied to selector that contain text we aimed to scrape.
Applied solution on example html sample from issue description will look like this:
from parsel import Selector
text='''
<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
'''
sel = Selector(text=text)
# All text
print(sel.css('h2').css('*::text').extract())
# ['This is the ', 'new', ' trend!']
print(sel.css('.post_info').css('*::text').extract())
# ['Published by newbie', 'on Sept 17']
print(sel.css('*::text').extract())
# ['\n', '\n', 'This is the ', 'new', ' trend!', '\n', 'Published by newbie', 'on Sept 17', '\n', '\n']
It is not perfect but (at least for usecases I had) - it is already enough to cover this and similar cases (without digging deep into lxml internals).
lxml.html has the convenience method .text_content() that collects all of the text content of an element. Somethings similar could be added to the Selector and SelectorList classes...
I just realized that Selector.root - is lxml's html object created by it's create_root_node
method. It means that if parser type is html
- mentioned text_content
can be applied here (as well as any other it's lxml methods):
print(sel.root.text_content())
'''
This is the new trend!
Published by newbieon Sept 17
'''
Cases when Selector query return SelectorList
a bit more complicated:
print([s.root.text_content() for s in sel.css('h2')])
# ['This is the new trend!']
print([s.root.text_content() for s in sel.css('.post_info')])
# ['Published by newbieon Sept 17']
Applying bind to lxml's text_content
into Selector
and SelectorList
types - looks like the most practical approach here.
As far as I understand both options mentioned above was technically applicable on 2018 when this ticket was created.
Ugh. I guess I'll just stick to the selectolax library. I'm a big fan of its text()
method. It's got deep
, separator
, and strip
parameters. It's also incredibly fast. The major drawback is that it doesn't support XPath.