trafilatura
trafilatura copied to clipboard
Fixed bug with @data-testid and removed some classes
Hey @adbar,
I was looking the benchmark and I found some attributes that can be improved.
Also, I found a bug in my last pull request with "@data-testid", I found out Reuters is using it in the whole website and when I set favor_precision=True
it is getting the wrong content.
It will be great if you could release a minor release with that fix.
- class 'permission' remove unwanted content from The Independent
- class 'lightbox' is usually means modals
- class ' hide' remove content that contains the class hide in the end.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 97.83%. Comparing base (
d2bcb1c
) to head (d44b40a
).
Additional details and impacted files
@@ Coverage Diff @@
## master #540 +/- ##
=======================================
Coverage 97.83% 97.83%
=======================================
Files 21 21
Lines 3514 3514
=======================================
Hits 3438 3438
Misses 76 76
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks, I'll have a look at it later. I have a few fixes in the works, once a few PRs are merged I'll make a new release.
@felipehertzer I added a small script and updated the guidelines in tests/README.rst. If you want to have a look, it's how I test such cases.
Hey @adbar, sorry I should have check it, this tag data-testid is new to me.
I have a few questions:
- Can I add more sites to the check?
- Can we have it run automatically on pull request and post a comment here?
- I saw that you are testing Reuters, but the HTML is different, to the live version, and does not contain the new tag, do you have a script to download the HTML into the eval folder? or how can we stay updated and keep metrics correct?
Thanks.
No worries! I'll work on the documentation but here is how I'd answer your questions:
- Yes you could add more websites if there are not already in the benchmark and not all in English (there are too many English-only benchmarks), there is interesting work on it in #197 but the PR seems to be stale.
- I'm not sure how to automatically run it but if you do you could add the evaluation to the CI/CD.
- That's tricky, the pages change through time but it's also good to have old pages in the benchmark because people are also using the extractor on web archives. I'd say the new tag in the Reuters page is rather a case for
tests/realworld_tests.py
(you could add a new function and test problematic markup there).
Alright, I have a few Australians and I can help with some Brazillian Portuguese as well.
I can have a look in the Github action, you may need setup a env variable, but I will test it first.
Maybe we need to start adding real world tests when working on xpaths so we can prevent this from happening again
@felipehertzer You'll see that I annotated metadata for some of the documents in the benchmark. Since you're interested in certain metadata, you could maybe add a small evaluation for it as well?
I find a slightly better recall for @data-testid
alone instead of @data-testid="AuthorCard"
. The rest are either not visible or have negative effects.
Maybe the XPaths are not always the best option to address such cases or maybe there is a better combination to be found.
@felipehertzer Are you still working on the PR?
Hey @adbar, I changed the data-testid to be a little less specific. For the other task I will open a new pull request. Thanks.
Thanks! There is something wrong with the tests but it's unrelated. I need to check it before merging.
I ran a few tests, the overall result is slightly worse on my data but it's still an acceptable change. The lightbox rules clearly harms accuracy so I removed it. Does the current form work as expected on your data?
The author XPath would be a case for AUTHOR_XPATHS
or AUTHOR_DISCARD_XPATHS
below, what do you think?
Hey @adbar,
Sorry, I totally missed your last comment.
I'm fine with the 'lightbox' I can use the custom xpath on trafilatura.
Yes, it can be 'AUTHOR_XPATHS', in that case was a card on the bottom of the article with the author info. Or maybe AuthorURL in that case:
<span class="_1a6f5d6fad2874362309" data-testid="AuthorCard">
<span>
<strong>
<a class="d1dba0c3091a3c30ebd6" data-testid="AuthorURL" href="/by/p535y1">AUTHOR NAME</a></strong> is a senior reporter with The Australian....
</span>
</span>
I'll add new evaluation data shortly and then test again to make sure nothing breaks with this PR. You can continue working on it if you find something else.
I just added the author xpath and test above to the extractor in #567.