trafilatura
trafilatura copied to clipboard
Add option to provide XPaths for content extraction
I can't parse comments on Reddit pages (old.reddit.com). The issue is likely due to the comments being encapsulated within form
elements, which the parser may not be handling correctly.
Steps to reproduce
Run trafilatura -u "https://old.reddit.com/r/programming/comments/1cnvy7y/how_stripe_prevents_double_payment_using/"
Expected Behavior trafilatura should successfully parse and extract all visible Reddit comments.
Actual Behavior Only user names, points, and number of children are extracted:
[–]SittingWave 665 points666 points667 points (73 children)
[–]barbouk 55 points56 points57 points (2 children)
[–]caltheon 15 points16 points17 points (0 children)
[–]TheSameTrain 19 points20 points21 points (0 children)
[–]WannaBeRichieRich 95 points96 points97 points (67 children)
[...]
Adding a --recall
flag doesn't change anything.
Is it possible to manually specify which elements should be parsed?