parsel Add option to retrieve text content

As a scrapy user, I often want to extract the text content of an element. The default option in parsel is to either use the ::text pseudo-element or XPath text(). Both options have the downside that they return all text nodes as individual elements. When the element contains child elements, this creates unwanted behavior. E.g.:

<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>

>>> response.css('h2::text').extract()
['This is the ', ' trend!']
>>> response.css('.post_info::text').extract()
['Published by newbie', 'on Sept 17']

With a basic understanding of XML and XPath, this behavior is expected. But it requires overhead to work around it, and it often creates frustrations with new users. There is a series of questions on stackoverflow as well as on the scrapy bug tracker:

lxml.html has the convenience method .text_content() that collects all of the text content of an element. Somethings similar could be added to the Selector and SelectorList classes. I could imagine two ways to approach the required API:

Either, there could be additional .extract_text()/.get_text() methods. This seems clean and easy to use, but would lead to potentially convoluted method names like .extract_first_text() (or .extract_text_first()?).
Or add a parameter to .extract*()/.get(), similar to the proposal in #101. This could be .extract(format_as='text'). This is less intrusive, but maybe less easy to discover.

Would such an addition be welcome? I could prepare a patch.

Nov 16 '18 20:11 frederik-elwert

Hey @frederik-elwert! This is being worked on here: https://github.com/scrapy/parsel/pull/127 :)

Nov 17 '18 12:11 kmike

Please consider this as basic feature and add It.

Feb 07 '20 09:02 kamrankausar

+1

May 23 '21 03:05 joecabezas

Any progress on this issue?

Feb 04 '22 10:02 bblanchon

Not much, but I've merged master to #127 yesterday, so the PR is up-to-date now. I think feature-wise it is ready; I'm happy with the implementation. But it needs some cleanup - more docs and tests.

Feb 10 '22 11:02 kmike

Any progress on this issue?

Aug 21 '22 18:08 celsofranssa

This still hasn't been addressed?

May 03 '23 19:05 mhillebrand

One working option Is to use.. chaining css calls with *::text query applied to selector that contain text we aimed to scrape. Applied solution on example html sample from issue description will look like this:


from parsel import Selector

text='''
<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
'''

sel = Selector(text=text)

# All text
print(sel.css('h2').css('*::text').extract())
# ['This is the ', 'new', ' trend!']

print(sel.css('.post_info').css('*::text').extract())
# ['Published by newbie', 'on Sept 17']

print(sel.css('*::text').extract())
# ['\n', '\n', 'This is the ', 'new', ' trend!', '\n', 'Published by newbie', 'on Sept 17', '\n', '\n']

It is not perfect but (at least for usecases I had) - it is already enough to cover this and similar cases (without digging deep into lxml internals).

lxml.html has the convenience method .text_content() that collects all of the text content of an element. Somethings similar could be added to the Selector and SelectorList classes...

I just realized that Selector.root - is lxml's html object created by it's create_root_node method. It means that if parser type is html - mentioned text_content can be applied here (as well as any other it's lxml methods):

print(sel.root.text_content())
'''

This is the new trend!
Published by newbieon Sept 17


'''

Cases when Selector query return SelectorList a bit more complicated:



print([s.root.text_content() for s in sel.css('h2')])
# ['This is the new trend!']

print([s.root.text_content() for s in sel.css('.post_info')])
# ['Published by newbieon Sept 17']

Applying bind to lxml's text_content into Selector and SelectorList types - looks like the most practical approach here.

As far as I understand both options mentioned above was technically applicable on 2018 when this ticket was created.

May 05 '23 07:05 GeorgeA92

Ugh. I guess I'll just stick to the selectolax library. I'm a big fan of its text() method. It's got deep, separator, and strip parameters. It's also incredibly fast. The major drawback is that it doesn't support XPath.

Oct 13 '23 21:10 mhillebrand

parsel parsel copied to clipboard

Add option to retrieve text content

parsel
parsel copied to clipboard