new fallback for extracting json
Fixes #143
Codecov Report
Merging #144 into master will decrease coverage by
0.55%. The diff coverage is76.19%.
@@ Coverage Diff @@
## master #144 +/- ##
==========================================
- Coverage 89.15% 88.60% -0.56%
==========================================
Files 12 12
Lines 535 553 +18
Branches 121 124 +3
==========================================
+ Hits 477 490 +13
- Misses 52 56 +4
- Partials 6 7 +1
| Impacted Files | Coverage Δ | |
|---|---|---|
| extruct/jsonld.py | 87.80% <76.19%> (-12.20%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update bd49a5f...50221f3. Read the comment docs.
Could you cover your changes with tests?
@Gallaecio I believe all the tests are still passing? The new helper func _extract_json_objects doesn't have a test but it's pretty straightforward.
The point of tests is to avoid regressions. So, ideally, the changes should include a test that “proves” that your changes work.
It’s quite important for long-term maintenance, and I think specifically in this case adding tests makes a lot of sense. For instance, it makes sure that in the future we do not break support for the scenario that you are fixing here while trying to support a different scenario. And we are bond to see such change attempts, people can be very creative in the way they break JSON :slightly_smiling_face:
lgtm
I don't know if you are willing to merge this pull request as it seem to be here for around 4 month now, but this solve the issue that the current HTML parser cause with ld+json scripts that uses non-english characters. Example:
{ "@context": "http://schema.org", "@type": "NewsArticle", "headline": "We found an acute just here: á", "author": { "@type": "Person", "name": "Lastname with tilde: ñ" } }
I am pretty new to Python, so, I am not sure on how to implement the test on this language yet. If you can create and merge this pull-request it would be awesome.
Thanks for feedback @kennylajara , although the PR still needs more work IMO, it's great to know that it resolves problems for you so that we know it's important.
It looks like the problem you encountered is different from the original issue, could you please share the URL where you have seen such markup?
Sure @lopuhin . You can see an example here: https://www.diariolibre.com/actualidad/gobierno-comenzara-a-entregar-el-bono-navideno-de-rd-1500-desde-este-11-de-diciembre-NG23168270
EDIT: I just noticed that the problem with the markup of the previous URL is not the tilde or the acute (the current code seems to work OK with other URLs that contain a similar markup).
The real problem is that the JSON+LD markup of that URL is broken due to non-escaped quotes in the news article description. But, at least, this code allows the scraper to ignore the broken element and extract the valid markup instead of just crashing.
The alternative solution is to wrap Extruct inside of a Try/Except, but this will ignore other important markup (like opengraph) while the code here will extract the opengraph and valid the part of the Json+ld markup.
The alternative solution is to wrap Extruct inside of a Try/Except, but this will ignore other important markup (like opengraph) while the code here will extract the opengraph and valid the part of the Json+ld markup.
Currently, this can be partly achieved with errors="ignore" or errors="log" argument to extruct.extract - at least it would continue other formats even if one of them fails.
Got it, you are right. errors="ignore" is working for me now.
Is there any other documentation? I don't see those options on the .README file.
I just wanna add one more comment:
The alternative solution is to wrap Extruct inside of a Try/Except, but this will ignore other important markup (like opengraph) while the code here will extract the opengraph and valid the part of the Json+ld markup.
Currently, this can be partly achieved with
errors="ignore"orerrors="log"argument toextruct.extract- at least it would continue other formats even if one of them fails.
Yes, almost... errors=ignore helps me to keep the scraper running but I still find this pull request usefull as it will not completly ingore the ld-json metadata. This code partially extract the metadata, just ignore the item containing the error.
Is there any other documentation? I don't see those options on the .README file.
Only here for now: https://github.com/scrapinghub/extruct/blob/c1e48037469d83579da0d20128b49e591423d716/extruct/_extruct.py#L17-L43
I still find this pull request usefull as it will not completly ingore the ld-json metadata. This code partially extract the metadata, just ignore the item containing the error.
Right 👍