parsel Support JMESPath now

Support jpath based on JMESPath, It can make extract more efficient, especially on Json mixed HTML, and i write test file too, your could test it!

To-do:

[x] Documentation

Closes #25

Jan 02 '20 02:01 EchoShoot

Codecov Report

Merging #181 (f7aa1c1) into master (5b28b54) will decrease coverage by 1.15%. The diff coverage is 91.79%.

:exclamation: Current head f7aa1c1 differs from pull request most recent head 53d5146. Consider uploading reports for the commit 53d5146 to get more accurate results

@@            Coverage Diff             @@
##           master     #181      +/-   ##
==========================================
- Coverage   91.69%   90.54%   -1.15%     
==========================================
  Files           5        5              
  Lines         349      455     +106     
  Branches       69       93      +24     
==========================================
+ Hits          320      412      +92     
- Misses         23       32       +9     
- Partials        6       11       +5

Impacted Files	Coverage Δ
parsel/selector.py	`89.47% <91.79%> (-1.44%)`	:arrow_down:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Jan 02 '20 02:01 codecov[bot]

Do you think you can include tests that cover the lines marked in yellow/red in https://codecov.io/gh/scrapy/parsel/pull/181/diff?

Feb 19 '20 16:02 Gallaecio

@Gallaecio Thank you for your continued guidance, I will fix the problem on the weekend, sorry for the late reply

Feb 19 '20 23:02 EchoShoot

@EchoShoot Do you plan to get back to this? Otherwise, would you mind if I take over?

Apr 28 '20 08:04 Gallaecio

@EchoShoot Do you plan to get back to this? Otherwise, would you mind if I take over?

I have been too busy working lately, and I ca n’t find time to do it. I will be happy to entrust it to you.

Apr 29 '20 01:04 EchoShoot

To do:

[x] Allow type to be json.
[x] Allow to specify a type for nested XML or HTML structures.
[x] Replace data with root, given type allows to tell them apart.
[x] When using jmespath with an invalid JSON string (json.loads raises ValueError), implement a behavior similar to a XML/HTML counterpart.
[x] Handle calls to xpath/css/jmespath when type is not compatible
[x] Handle Pylint’s too-many-instance-attributes
[x] Handle cases where the Selector gets both text and root (currently data) for JSON
[x] Complete test coverage

Jun 02 '20 09:06 Gallaecio

That's nice! Good work here, @EchoShoot and @Gallaecio.

Do you know if is it possible to use a Selector to parse code like this with the current implementation?

<script type="text/javascript">
var initialState = {
  "foo": "bar"
};
</script>

Maybe using a regular expression...

I can see there's a test case covering this other case:

<script id="initial-state" type="application/json">
{
  "foo": "bar"
}
</script>

But this is not always what we find in websites out there.

Jun 17 '20 23:06 victor-torres

I think that would be out of the scope of this pull request, but given js2xml, it does not sound too crazy to implement support for it in the future.

Jun 18 '20 04:06 Gallaecio

Hello people, Why is this PR still on-hold?

I understand that your last changes @Gallaecio fixed the issues with Scrapy. Can someone like me help you with anything?

I made my own extension to allow jsonpath-ng on my private Scrapy project because of a project cross-compatibility issue but now I'm able to migrate to JMESPath and I would like to help to have this PR merged instead of patching parsel on my own project.

Mar 20 '21 09:03 xPi2

At this point I think the main blocker is deciding the name or names of the method:

jmespath (current proposal)
jmespath and jmes (what @pauloromeira suggested, which I don’t like, I think having a single name would be more Pythonic)
jmes (what I would be OK with, even though I have a slight preference for jmespath)

Mar 20 '21 09:03 Gallaecio

If a third, external opinion is helpful 🤔

I'd choose to keep only one name, as you say is more Pythonic. Also, as a developer that works really close to people that only work on the selectors part of a business application built with Parsel, I have experienced firsthand that multiple options to do the same is almost every time worse than only one way of doing something.

If I would have to choose one I would go with jmespath. If you google "jmes" you get more noise than if you google "jmespath", and as I said before, for the final user, simpler is better.

I hope this helps you decide guys, I would love to have this issue solved asap. And of course, thanks for your amazing work maintaining these libraries. 👏🏼

Mar 20 '21 22:03 xPi2

Hi 1 month later, any update on this?

Apr 14 '21 19:04 xPi2

I performed a private survey with people that use Parsel professionally, and choices were 3:1 towards jmespath (no one opted for both), so I would say we will probably go with the current proposal. Now it’s just a matter of a maintainer finding time to review the proposal, and merge.

Apr 15 '21 07:04 Gallaecio

Scrapy 2.5.0 fixed https://github.com/scrapy/scrapy/issues/4961, so this change will not break Scrapy 2.5.0+.

May 12 '21 16:05 Gallaecio

For future reference, changes to implement in Scrapy once parsel is released with this change:

https://github.com/Gallaecio/scrapy/commit/d642881e25b6a7ce7d8f640b354459b9a4921790
https://github.com/Gallaecio/scrapy/commit/8bd79fcd669ad756c6dc4dca0850f446352e9a5a

Mar 08 '22 10:03 Gallaecio

I got feedback from @tcurvelo, who pointed out the lack of documentation. I will work on that when I get a chance, we should definitely not merge without it.

Mar 24 '22 18:03 Gallaecio

I got feedback from @tcurvelo, who pointed out the lack of documentation. I will work on that when I get a chance, we should definitely not merge without it.

@Gallaecio I can do that, no problems.