parsel icon indicating copy to clipboard operation
parsel copied to clipboard

Add option to retrieve text content

Open frederik-elwert opened this issue 6 years ago • 9 comments

As a scrapy user, I often want to extract the text content of an element. The default option in parsel is to either use the ::text pseudo-element or XPath text(). Both options have the downside that they return all text nodes as individual elements. When the element contains child elements, this creates unwanted behavior. E.g.:

<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
>>> response.css('h2::text').extract()
['This is the ', ' trend!']
>>> response.css('.post_info::text').extract()
['Published by newbie', 'on Sept 17']

With a basic understanding of XML and XPath, this behavior is expected. But it requires overhead to work around it, and it often creates frustrations with new users. There is a series of questions on stackoverflow as well as on the scrapy bug tracker:

lxml.html has the convenience method .text_content() that collects all of the text content of an element. Somethings similar could be added to the Selector and SelectorList classes. I could imagine two ways to approach the required API:

  • Either, there could be additional .extract_text()/.get_text() methods. This seems clean and easy to use, but would lead to potentially convoluted method names like .extract_first_text() (or .extract_text_first()?).
  • Or add a parameter to .extract*()/.get(), similar to the proposal in #101. This could be .extract(format_as='text'). This is less intrusive, but maybe less easy to discover.

Would such an addition be welcome? I could prepare a patch.

frederik-elwert avatar Nov 16 '18 20:11 frederik-elwert

Hey @frederik-elwert! This is being worked on here: https://github.com/scrapy/parsel/pull/127 :)

kmike avatar Nov 17 '18 12:11 kmike

Please consider this as basic feature and add It.

kamrankausar avatar Feb 07 '20 09:02 kamrankausar

+1

joecabezas avatar May 23 '21 03:05 joecabezas

Any progress on this issue?

bblanchon avatar Feb 04 '22 10:02 bblanchon

Not much, but I've merged master to #127 yesterday, so the PR is up-to-date now. I think feature-wise it is ready; I'm happy with the implementation. But it needs some cleanup - more docs and tests.

kmike avatar Feb 10 '22 11:02 kmike

Any progress on this issue?

celsofranssa avatar Aug 21 '22 18:08 celsofranssa

This still hasn't been addressed?

mhillebrand avatar May 03 '23 19:05 mhillebrand

One working option Is to use.. chaining css calls with *::text query applied to selector that contain text we aimed to scrape. Applied solution on example html sample from issue description will look like this:


from parsel import Selector

text='''
<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
'''

sel = Selector(text=text)

# All text
print(sel.css('h2').css('*::text').extract())
# ['This is the ', 'new', ' trend!']

print(sel.css('.post_info').css('*::text').extract())
# ['Published by newbie', 'on Sept 17']

print(sel.css('*::text').extract())
# ['\n', '\n', 'This is the ', 'new', ' trend!', '\n', 'Published by newbie', 'on Sept 17', '\n', '\n']

It is not perfect but (at least for usecases I had) - it is already enough to cover this and similar cases (without digging deep into lxml internals).

lxml.html has the convenience method .text_content() that collects all of the text content of an element. Somethings similar could be added to the Selector and SelectorList classes...

I just realized that Selector.root - is lxml's html object created by it's create_root_node method. It means that if parser type is html - mentioned text_content can be applied here (as well as any other it's lxml methods):

print(sel.root.text_content())
'''

This is the new trend!
Published by newbieon Sept 17


'''

Cases when Selector query return SelectorList a bit more complicated:



print([s.root.text_content() for s in sel.css('h2')])
# ['This is the new trend!']

print([s.root.text_content() for s in sel.css('.post_info')])
# ['Published by newbieon Sept 17']


Applying bind to lxml's text_content into Selector and SelectorList types - looks like the most practical approach here.

As far as I understand both options mentioned above was technically applicable on 2018 when this ticket was created.

GeorgeA92 avatar May 05 '23 07:05 GeorgeA92

Ugh. I guess I'll just stick to the selectolax library. I'm a big fan of its text() method. It's got deep, separator, and strip parameters. It's also incredibly fast. The major drawback is that it doesn't support XPath.

image

image

mhillebrand avatar Oct 13 '23 21:10 mhillebrand