trafilatura icon indicating copy to clipboard operation
trafilatura copied to clipboard

Fixed bug with @data-testid and removed some classes

Open felipehertzer opened this issue 10 months ago • 14 comments

Hey @adbar,

I was looking the benchmark and I found some attributes that can be improved.

Also, I found a bug in my last pull request with "@data-testid", I found out Reuters is using it in the whole website and when I set favor_precision=True it is getting the wrong content.

It will be great if you could release a minor release with that fix.

  • class 'permission' remove unwanted content from The Independent
  • class 'lightbox' is usually means modals
  • class ' hide' remove content that contains the class hide in the end.

felipehertzer avatar Apr 03 '24 13:04 felipehertzer

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 97.83%. Comparing base (d2bcb1c) to head (d44b40a).

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #540   +/-   ##
=======================================
  Coverage   97.83%   97.83%           
=======================================
  Files          21       21           
  Lines        3514     3514           
=======================================
  Hits         3438     3438           
  Misses         76       76           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Apr 03 '24 13:04 codecov[bot]

Thanks, I'll have a look at it later. I have a few fixes in the works, once a few PRs are merged I'll make a new release.

adbar avatar Apr 03 '24 16:04 adbar

@felipehertzer I added a small script and updated the guidelines in tests/README.rst. If you want to have a look, it's how I test such cases.

adbar avatar Apr 04 '24 16:04 adbar

Hey @adbar, sorry I should have check it, this tag data-testid is new to me.

I have a few questions:

  • Can I add more sites to the check?
  • Can we have it run automatically on pull request and post a comment here?
  • I saw that you are testing Reuters, but the HTML is different, to the live version, and does not contain the new tag, do you have a script to download the HTML into the eval folder? or how can we stay updated and keep metrics correct?

Thanks.

felipehertzer avatar Apr 05 '24 03:04 felipehertzer

No worries! I'll work on the documentation but here is how I'd answer your questions:

  • Yes you could add more websites if there are not already in the benchmark and not all in English (there are too many English-only benchmarks), there is interesting work on it in #197 but the PR seems to be stale.
  • I'm not sure how to automatically run it but if you do you could add the evaluation to the CI/CD.
  • That's tricky, the pages change through time but it's also good to have old pages in the benchmark because people are also using the extractor on web archives. I'd say the new tag in the Reuters page is rather a case for tests/realworld_tests.py (you could add a new function and test problematic markup there).

adbar avatar Apr 05 '24 10:04 adbar

Alright, I have a few Australians and I can help with some Brazillian Portuguese as well.

I can have a look in the Github action, you may need setup a env variable, but I will test it first.

Maybe we need to start adding real world tests when working on xpaths so we can prevent this from happening again

felipehertzer avatar Apr 05 '24 10:04 felipehertzer

@felipehertzer You'll see that I annotated metadata for some of the documents in the benchmark. Since you're interested in certain metadata, you could maybe add a small evaluation for it as well?

adbar avatar Apr 05 '24 10:04 adbar

I find a slightly better recall for @data-testid alone instead of @data-testid="AuthorCard". The rest are either not visible or have negative effects. Maybe the XPaths are not always the best option to address such cases or maybe there is a better combination to be found.

adbar avatar Apr 05 '24 14:04 adbar

@felipehertzer Are you still working on the PR?

adbar avatar Apr 17 '24 16:04 adbar

Hey @adbar, I changed the data-testid to be a little less specific. For the other task I will open a new pull request. Thanks.

felipehertzer avatar Apr 22 '24 05:04 felipehertzer

Thanks! There is something wrong with the tests but it's unrelated. I need to check it before merging.

adbar avatar Apr 22 '24 10:04 adbar

I ran a few tests, the overall result is slightly worse on my data but it's still an acceptable change. The lightbox rules clearly harms accuracy so I removed it. Does the current form work as expected on your data?

The author XPath would be a case for AUTHOR_XPATHS or AUTHOR_DISCARD_XPATHS below, what do you think?

adbar avatar Apr 23 '24 10:04 adbar

Hey @adbar,

Sorry, I totally missed your last comment.

I'm fine with the 'lightbox' I can use the custom xpath on trafilatura.

Yes, it can be 'AUTHOR_XPATHS', in that case was a card on the bottom of the article with the author info. Or maybe AuthorURL in that case:

<span class="_1a6f5d6fad2874362309" data-testid="AuthorCard">
<span>
<strong>
<a class="d1dba0c3091a3c30ebd6" data-testid="AuthorURL" href="/by/p535y1">AUTHOR NAME</a></strong> is a senior reporter with The Australian....
</span>
</span>

felipehertzer avatar May 03 '24 07:05 felipehertzer

I'll add new evaluation data shortly and then test again to make sure nothing breaks with this PR. You can continue working on it if you find something else.

I just added the author xpath and test above to the extractor in #567.

adbar avatar May 22 '24 16:05 adbar