extruct icon indicating copy to clipboard operation
extruct copied to clipboard

new fallback for extracting json

Open Vitiell0 opened this issue 5 years ago • 11 comments

Fixes #143

Vitiell0 avatar Jul 27 '20 20:07 Vitiell0

Codecov Report

Merging #144 into master will decrease coverage by 0.55%. The diff coverage is 76.19%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #144      +/-   ##
==========================================
- Coverage   89.15%   88.60%   -0.56%     
==========================================
  Files          12       12              
  Lines         535      553      +18     
  Branches      121      124       +3     
==========================================
+ Hits          477      490      +13     
- Misses         52       56       +4     
- Partials        6        7       +1     
Impacted Files Coverage Δ
extruct/jsonld.py 87.80% <76.19%> (-12.20%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update bd49a5f...50221f3. Read the comment docs.

codecov[bot] avatar Jul 27 '20 20:07 codecov[bot]

Could you cover your changes with tests?

Gallaecio avatar Jul 28 '20 10:07 Gallaecio

@Gallaecio I believe all the tests are still passing? The new helper func _extract_json_objects doesn't have a test but it's pretty straightforward.

Vitiell0 avatar Jul 28 '20 16:07 Vitiell0

The point of tests is to avoid regressions. So, ideally, the changes should include a test that “proves” that your changes work.

It’s quite important for long-term maintenance, and I think specifically in this case adding tests makes a lot of sense. For instance, it makes sure that in the future we do not break support for the scenario that you are fixing here while trying to support a different scenario. And we are bond to see such change attempts, people can be very creative in the way they break JSON :slightly_smiling_face:

Gallaecio avatar Jul 28 '20 18:07 Gallaecio

lgtm

advance512 avatar Aug 03 '20 21:08 advance512

I don't know if you are willing to merge this pull request as it seem to be here for around 4 month now, but this solve the issue that the current HTML parser cause with ld+json scripts that uses non-english characters. Example:

{ "@context": "http://schema.org", "@type": "NewsArticle", "headline": "We found an acute just here: á", "author": { "@type": "Person", "name": "Lastname with tilde: ñ" } }

I am pretty new to Python, so, I am not sure on how to implement the test on this language yet. If you can create and merge this pull-request it would be awesome.

kennylajara avatar Dec 08 '20 02:12 kennylajara

Thanks for feedback @kennylajara , although the PR still needs more work IMO, it's great to know that it resolves problems for you so that we know it's important.

It looks like the problem you encountered is different from the original issue, could you please share the URL where you have seen such markup?

lopuhin avatar Dec 08 '20 08:12 lopuhin

Sure @lopuhin . You can see an example here: https://www.diariolibre.com/actualidad/gobierno-comenzara-a-entregar-el-bono-navideno-de-rd-1500-desde-este-11-de-diciembre-NG23168270

EDIT: I just noticed that the problem with the markup of the previous URL is not the tilde or the acute (the current code seems to work OK with other URLs that contain a similar markup).

The real problem is that the JSON+LD markup of that URL is broken due to non-escaped quotes in the news article description. But, at least, this code allows the scraper to ignore the broken element and extract the valid markup instead of just crashing.

The alternative solution is to wrap Extruct inside of a Try/Except, but this will ignore other important markup (like opengraph) while the code here will extract the opengraph and valid the part of the Json+ld markup.

kennylajara avatar Dec 08 '20 12:12 kennylajara

The alternative solution is to wrap Extruct inside of a Try/Except, but this will ignore other important markup (like opengraph) while the code here will extract the opengraph and valid the part of the Json+ld markup.

Currently, this can be partly achieved with errors="ignore" or errors="log" argument to extruct.extract - at least it would continue other formats even if one of them fails.

lopuhin avatar Dec 08 '20 13:12 lopuhin

Got it, you are right. errors="ignore" is working for me now.

Is there any other documentation? I don't see those options on the .README file.

I just wanna add one more comment:

The alternative solution is to wrap Extruct inside of a Try/Except, but this will ignore other important markup (like opengraph) while the code here will extract the opengraph and valid the part of the Json+ld markup.

Currently, this can be partly achieved with errors="ignore" or errors="log" argument to extruct.extract - at least it would continue other formats even if one of them fails.

Yes, almost... errors=ignore helps me to keep the scraper running but I still find this pull request usefull as it will not completly ingore the ld-json metadata. This code partially extract the metadata, just ignore the item containing the error.

kennylajara avatar Dec 08 '20 14:12 kennylajara

Is there any other documentation? I don't see those options on the .README file.

Only here for now: https://github.com/scrapinghub/extruct/blob/c1e48037469d83579da0d20128b49e591423d716/extruct/_extruct.py#L17-L43

I still find this pull request usefull as it will not completly ingore the ld-json metadata. This code partially extract the metadata, just ignore the item containing the error.

Right 👍

lopuhin avatar Dec 08 '20 14:12 lopuhin