parsel icon indicating copy to clipboard operation
parsel copied to clipboard

[Feature Request] Add support for JMESPath

Open voith opened this issue 9 years ago • 23 comments

Building a Selector based on JMESPath in parsel will help ease parsing Json. This will also help scrapy to add methods like add_json and get_json to the ItemLoader. I got this idea from scrapy/scrapy#1005. From what I understand, the Selector in parsel has been built using lxml, how about using jmespath for building a JsonSelector ?

I am not sure if this is the feature to have in this library as Parsel describes itself as a parser for XML/HTML. But adding this feature will add great value to this project.

PS: If the maintainers would like to have this feature in, Than I'd like to contribute to it myself.

voith avatar Jan 27 '16 06:01 voith

That's an interesting idea. We were just talking about perhaps adding a JsonResponse to Scrapy in https://github.com/scrapy/scrapy/issues/1729

I'd be okay with adding a JsonSelector completely separate from the already existing Selector, and then providing a factory function selector_for(response_text) that would do something like:

def selector_for(text):
    try:
        return JsonSelector(text):
    catch NotAJsonError:
        return Selector(text)

@dangra @kmike what do you think, fellows?

eliasdorneles avatar Jan 27 '16 12:01 eliasdorneles

Alright I'll prototype this idea and see how it goes.

voith avatar Jan 27 '16 15:01 voith

@eliasdorneles : do you recall XPathHtmlSelector, XPathXMLSelector, CSSHtmlSelector...?

I am not fond of using different class for JMESPath, we ditched it already in favour of a single class with different methods per selection type.

From the tip of my head the main reason I recall is simpler nesting of selection methods: response.css('div').xpath('.//script').jmespath(...)

dangra avatar Jan 27 '16 17:01 dangra

@dangra I see. Hm, my thinking was that the input for both would be different (a selector supporting JMESPath wants JSON, not HTML/XML).

Do we have an use case for response.xpath().jmes() or response.jmes().xpath()? I suppose it could be useful when one has escaped HTML inside a AJAX JSON response or JSON inside an HTML attribute -- are those the ones in your mind?

eliasdorneles avatar Jan 27 '16 17:01 eliasdorneles

There are useful use cases for chaining (e.g. processing data- attributes), but I think they don't worth extra complexity we may introduce to support them.

response.jmespath(...) or jmespath.search(...) covers most use cases nicely and much easier to implement and understand.

kmike avatar Jan 27 '16 18:01 kmike

@kmike curious how you're thinking about the implementation. You mentioned response.jmespath -- we don't have response in Parsel, did you mean it as a method for the Selector class itself?

eliasdorneles avatar Jan 27 '16 18:01 eliasdorneles

There are useful use cases for chaining (e.g. processing data- attributes), but I think they don't worth extra complexity we may introduce to support them.

This is exactly what Parsel provides, it moves the implementation complexity out from users.

Do we have an use case for response.xpath().jmes() or response.jmes().xpath()? I suppose it could be useful when one has escaped HTML inside a AJAX JSON response or JSON inside an HTML attribute -- are those the ones in your mind?

Both examples for chaining JSON and HTML are valid and making chaining easy is part of Parsel philosophy.

dangra avatar Jan 28 '16 16:01 dangra

This is exactly what Parsel provides, it moves the implementation complexity out from users.

Users who try to subclass the selectors will end up facing these complexities. So far, the implementation has been inviting for subclassing.

Digenis avatar Jan 28 '16 16:01 Digenis

@Digenis I don't think there is such complexity for users extending Selector class, I can understand there was bit when CSS selection method was added because behind the scenes it translates to xpath and reuse it. But for JMESPath this is going to be a completely new method, it doesn't interfere with existent methods at all.

I think we have two options:

  1. Adding a "tailing" selection method like .re() named .jmespath() which is a thin wrapper for jmespath.search(). Pro: Simple to implement and can be chained after xpath/css(). Cons: chaining JSON->XPATH is not possible (although it is the less common I think) .
  2. Implement the full fledge selection interface: Selection method returns SelectorList() instance and extract() returns list of unicode text. Pro: Chaining all the way is possible Cons: We may need a tailing method anyway to parse json.

dangra avatar Jan 28 '16 20:01 dangra

Ok! I must admit Option 2 is complex because we are parsing the DOM in Selector constructor

but option 1 is still compelling, isn't it? :)

dangra avatar Jan 28 '16 21:01 dangra

I agree that option (1) looks easy enough to implement, but have anyone had a real use case for it? If I understood @voith properly, he wanted to parse JSON using Parsel (no XML/HTML involved at all), not to query some JSON data extracted form XML/HTML element attributes.

kmike avatar Jan 28 '16 21:01 kmike

Yes I opened this issue with the intention of being able to parse JSON with Parsel. Although It'd be great to have chained parsing. But the implementation of having jmespath under an XML/HTML selector sounds complex as the inputs are different.

voith avatar Jan 29 '16 15:01 voith

We can delay the parsing of the DOM until the first selection method is called. That will trigger json, xml, html parsing on demand.

dangra avatar Jan 29 '16 19:01 dangra

Offering .json()/.jmespath()/.jsonpath() for a Selector instantiated with a JSON string, with type="json"? why not. Being able to chain JSON selectors? why not as well.

But I don't see a compelling use case for chaining .xpath()/.css() and .json()/.jmespath()/.jsonpath()

Internally, in current parsel implementation, once the input is parsed, the chaining navigates inside the same parsed document tree, it does not re-parse to build a new document.

Take, say, some HTML document containing comments which themselves contain HTML code, think facebook's view-source:https://www.facebook.com/JustinBieber/

<code class="hidden_elem" id="u_0_15">
<!-- <div class="_5ay5"><div class="_4-u2 _4-u8"><div id="u_0_14"></div></div></div> -->
</code>

parsel does not support something like selector.css('code#u_0_15').xpath('string(comment())').xpath('//@id') for the previous example:

>>> selector = parsel.Selector(text=u'''<code class="hidden_elem" id="u_0_15">
... <!-- <div class="_5ay5"><div class="_4-u2 _4-u8"><div id="u_0_14"></div></div></div> -->
... </code>''')

>>> selector.css('#u_0_15').xpath('comment()').extract_first()
u'<!-- <div class="_5ay5"><div class="_4-u2 _4-u8"><div id="u_0_14"></div></div></div> -->'

>>> selector.css('#u_0_15').xpath('string(comment())').extract_first()
u' <div class="_5ay5"><div class="_4-u2 _4-u8"><div id="u_0_14"></div></div></div> '

>>> selector.css('#u_0_15').xpath('string(comment())').xpath('//@id')
[]

You still have to reinject into another selector to work on the embedded HTML:

>>> parsel.Selector(
...         selector.css('#u_0_15').xpath('string(comment())').extract_first()
...     ).xpath('//@id').extract()
[u'u_0_14']
>>> 

redapple avatar Jan 29 '16 22:01 redapple

Finding a mixed form of the json and xml\html is not rare when we crawl

I have been submit pull request #181 . Implemented a method named jpath that could be use it like xpath and css with chaining.

Here are some example

when have json in html

<div>
    <h1>Information</h1>
    <content>
            {
              "user": [
                        { "name": "A", "age": 18},
                        {"name": "B","age": 32},
                        {"name": "C","age": 22},
                        {"name": "D","age": 25}
              ],
              "total": 4,
              "status": "ok"
            }
    </content>
</div>
  • extract with this syntax
>>> sel.xpath('//div/content').jpath('user[*].name').getall()
['A', 'B', 'C', 'D']

when have html in json

{
    "content": [
                        { "name": "A", "value": "a" },
                        {"name": {"age": 18}, "value": "b"},
                        {"name": "C", "value": "c"},
                        {"name": "<a>D</a>", "value": "<div>d</div>"}
                    ],
    "html": [
                  "<div><a>AAA<br>Test</a>aaa</div><div><a>BBB</a>bbb<b>BbB</b><div/>"
                 ]
}
  • extract with this syntax
>>> sel.jpath('html').xpath('//div/a/text()').getall()
['AAA', 'Test', 'BBB']

By the way, it will called json.loads() inside of selector, it means we could use it normally that Selector(text="{"A":"a"}") It will also facilitate the implementation of response.jpath ('...') rather than Selector(json=response.json()) in scrapy

EchoShoot avatar Jan 02 '20 04:01 EchoShoot

Hey guys! I think we need to discuss which name is better, jsonpath? Jpath? Jmespath?

  • In my opinion, jpath is relatively short and suitable for developers from all over the world to remember. it may be confusing, but will gradually become mainstream over time.
  • jsonpath is also a good name, but a bit long, not conducive to chained calls.
  • jmespath is difficult to remembered, especially for a man who first language is not english.

EchoShoot avatar Jan 03 '20 01:01 EchoShoot

JMESPath and JSONPath are different JSON query languages. If we use jpath, it is unclear which one we are using, and things can get worse if a new JSON query language is ever implemented with that name (JPath).

Moreover, just as we support 2 different HTML/XML query languages (CSS and XPath), at some point we may support multiple JSON query languages (e.g. JMESPath, JSONPath and jq); so I really believe that jpath is a bad choice in the long run.

Yesterday I found out that Parsel used to have a select method, probably back when only one of CSS and XPath was supported. Care to guess which one it used? :)

Gallaecio avatar Jan 03 '20 12:01 Gallaecio

You convinced me, I agree with you now, I decided to adopt jmespath, thank you very much for your help.^0^

EchoShoot avatar Jan 03 '20 14:01 EchoShoot

How is going this? I'm trying to implement this myself over parsel selector in my own project but I'm sure you know how to do it better.

xPi2 avatar Jul 28 '20 10:07 xPi2

I’m not entirely against it, but given that we use selector.xpath() instead of selector.x(), I think jmespath is more coherent, and it is not that long.

Gallaecio avatar Jul 28 '20 11:07 Gallaecio

Not to derail this but I'd argue that implementing JSONpath[1] would actually be more fitting for parsel as it is xpath like. For example Jmespath doesn't support recursive queries (like //node xpath) while Jsonpath does (as $..node); also the whole protocol structure is much more similar to that of xpath.

Ideally it would be great to have both! More and more web is using json and would be great to have one good parser for both html and json.

1 - https://github.com/h2non/jsonpath-ng jsonpath implementation in Python

Granitosaurus avatar Nov 16 '20 10:11 Granitosaurus

I’ve added JMESPath support to a real-life project, and I must say @Granitosaurus you are completely right. The lack of the concept of parent nodes in JMESPath can be quite limiting, just as in CSS. It feels like JMESPath is to JSONpath what CSS is to XPath.

So, once this is fixed, I agree we should aim to extend support to JSONpath. Hopefully it won’t be too hard at that point.

Gallaecio avatar Feb 21 '21 16:02 Gallaecio

I think Jmespath should be supported first, because it has been actively maintained over the years, and has plenty of resources and documentation. Many developers can find a way to get started. Then we can wait for a better and more robust json parser to appear.This doesn't conflict, just like css doesn't conflict with xpath, both are supported by parsel at the same time.

EchoShoot avatar Mar 16 '22 02:03 EchoShoot