parsel
parsel copied to clipboard
Support JMESPath now
Support jpath based on JMESPath, It can make extract more efficient, especially on Json mixed HTML, and i write test file too, your could test it!
To-do:
- [x] Documentation
Closes #25
Codecov Report
Merging #181 (f7aa1c1) into master (5b28b54) will decrease coverage by
1.15%
. The diff coverage is91.79%
.
:exclamation: Current head f7aa1c1 differs from pull request most recent head 53d5146. Consider uploading reports for the commit 53d5146 to get more accurate results
@@ Coverage Diff @@
## master #181 +/- ##
==========================================
- Coverage 91.69% 90.54% -1.15%
==========================================
Files 5 5
Lines 349 455 +106
Branches 69 93 +24
==========================================
+ Hits 320 412 +92
- Misses 23 32 +9
- Partials 6 11 +5
Impacted Files | Coverage Δ | |
---|---|---|
parsel/selector.py | 89.47% <91.79%> (-1.44%) |
:arrow_down: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
Do you think you can include tests that cover the lines marked in yellow/red in https://codecov.io/gh/scrapy/parsel/pull/181/diff?
@Gallaecio Thank you for your continued guidance, I will fix the problem on the weekend, sorry for the late reply
@EchoShoot Do you plan to get back to this? Otherwise, would you mind if I take over?
@EchoShoot Do you plan to get back to this? Otherwise, would you mind if I take over?
I have been too busy working lately, and I ca n’t find time to do it. I will be happy to entrust it to you.
To do:
- [x] Allow
type
to bejson
. - [x] Allow to specify a
type
for nested XML or HTML structures. - [x] Replace
data
withroot
, giventype
allows to tell them apart. - [x] When using jmespath with an invalid JSON string (
json.loads
raisesValueError
), implement a behavior similar to a XML/HTML counterpart. - [x] Handle calls to xpath/css/jmespath when
type
is not compatible - [x] Handle Pylint’s
too-many-instance-attributes
- [x] Handle cases where the
Selector
gets bothtext
androot
(currentlydata
) for JSON - [x] Complete test coverage
That's nice! Good work here, @EchoShoot and @Gallaecio.
Do you know if is it possible to use a Selector to parse code like this with the current implementation?
<script type="text/javascript">
var initialState = {
"foo": "bar"
};
</script>
Maybe using a regular expression...
I can see there's a test case covering this other case:
<script id="initial-state" type="application/json">
{
"foo": "bar"
}
</script>
But this is not always what we find in websites out there.
I think that would be out of the scope of this pull request, but given js2xml, it does not sound too crazy to implement support for it in the future.
Hello people, Why is this PR still on-hold?
I understand that your last changes @Gallaecio fixed the issues with Scrapy. Can someone like me help you with anything?
I made my own extension to allow jsonpath-ng on my private Scrapy project because of a project cross-compatibility issue but now I'm able to migrate to JMESPath and I would like to help to have this PR merged instead of patching parsel on my own project.
At this point I think the main blocker is deciding the name or names of the method:
-
jmespath
(current proposal) -
jmespath
andjmes
(what @pauloromeira suggested, which I don’t like, I think having a single name would be more Pythonic) -
jmes
(what I would be OK with, even though I have a slight preference forjmespath
)
If a third, external opinion is helpful 🤔
I'd choose to keep only one name, as you say is more Pythonic. Also, as a developer that works really close to people that only work on the selectors part of a business application built with Parsel, I have experienced firsthand that multiple options to do the same is almost every time worse than only one way of doing something.
If I would have to choose one I would go with jmespath
. If you google "jmes" you get more noise than if you google "jmespath", and as I said before, for the final user, simpler is better.
I hope this helps you decide guys, I would love to have this issue solved asap. And of course, thanks for your amazing work maintaining these libraries. 👏🏼
Hi 1 month later, any update on this?
I performed a private survey with people that use Parsel professionally, and choices were 3:1 towards jmespath
(no one opted for both), so I would say we will probably go with the current proposal. Now it’s just a matter of a maintainer finding time to review the proposal, and merge.
Scrapy 2.5.0 fixed https://github.com/scrapy/scrapy/issues/4961, so this change will not break Scrapy 2.5.0+.
For future reference, changes to implement in Scrapy once parsel is released with this change:
- https://github.com/Gallaecio/scrapy/commit/d642881e25b6a7ce7d8f640b354459b9a4921790
- https://github.com/Gallaecio/scrapy/commit/8bd79fcd669ad756c6dc4dca0850f446352e9a5a
I got feedback from @tcurvelo, who pointed out the lack of documentation. I will work on that when I get a chance, we should definitely not merge without it.
I got feedback from @tcurvelo, who pointed out the lack of documentation. I will work on that when I get a chance, we should definitely not merge without it.
@Gallaecio I can do that, no problems.
~~For the record, there is an ongoing pull request to merge into this one at https://github.com/EchoShoot/parsel/pull/4~~ Merged.
Thank you everyone!
Great job @EchoShoot @Gallaecio @felipeboffnunes!