trafilatura Add option to provide XPaths for content extraction

Add option to provide XPaths for content extraction

Open klvbdmh opened this issue 9 months ago • 2 comments

I can't parse comments on Reddit pages (old.reddit.com). The issue is likely due to the comments being encapsulated within form elements, which the parser may not be handling correctly.

Steps to reproduce Run trafilatura -u "https://old.reddit.com/r/programming/comments/1cnvy7y/how_stripe_prevents_double_payment_using/"

Expected Behavior trafilatura should successfully parse and extract all visible Reddit comments.

Actual Behavior Only user names, points, and number of children are extracted:

[–]SittingWave 665 points666 points667 points (73 children)
[–]barbouk 55 points56 points57 points (2 children)
[–]caltheon 15 points16 points17 points (0 children)
[–]TheSameTrain 19 points20 points21 points (0 children)
[–]WannaBeRichieRich 95 points96 points97 points (67 children)
[...]

Adding a --recall flag doesn't change anything.

Is it possible to manually specify which elements should be parsed?

May 16 '24 01:05 klvbdmh

trafilatura trafilatura copied to clipboard

Add option to provide XPaths for content extraction

trafilatura
trafilatura copied to clipboard