graphql-scraper icon indicating copy to clipboard operation
graphql-scraper copied to clipboard

regex sub-selection

Open konsumer opened this issue 6 years ago • 4 comments

I often have to do a bunch of regexes to get the text I actually need. If I could put it in my query, that'd be even better.

Here is an example:

I have a lil query that grabs some data from the google play-store. I do my pre-processing of input via js template strings in the query, and I'd like to do my post-processing in the query itself:

const details = (id, country='US', lang='en') => graphql(schema, `{
  page(url: "https://play.google.com/store/apps/details?id=${id}&hl=${lang}_${country}"){
    title: text(selector: "[itemprop='name']")
    icon: attr(selector: "[itemprop='image']", name: "src")
    developerName: text(selector: "a[href^='/store/apps/dev']")
    developerUrl: attr(selector: "a[href^='/store/apps/dev']", name: "href")
    developerId: attr(selector: "a[href^='/store/apps/dev']", name: "href", search="/store/apps/dev?id=(.+)")
  }
}`)

In this example, I am pulling developerId from the same place I get developerUrl, extracting the id from the regex search. I'm not quite sure how to handle multiple matches, but it would be pretty useful, even if it just returned the 1st match:

{
  "data": {
    "page": {
      "title": "Hello Neighbor",
      "icon": "https://lh3.googleusercontent.com/r1wx-kmI9I_zxv8UIF_0_YvmhoLOx25mjT23GCO4bse6H-pgqfjZ5Tvz3HRJ0i2HdEoQ=s100",
      "developerName": "tinyBuild",
      "developerUrl": "/store/apps/dev?id=4988311280735374056",
      "developerId": "4988311280735374056"
    }
  }
}

Is there interest in this? Should I make a PR?

konsumer avatar Oct 17 '19 04:10 konsumer

Hey, sorry I'm just seeing this (been busy with a lot of non open source stuff ha) - this sounds like a really cool idea! You could have an argument called match (& possibly also test?) which returns the matched result. Probably makes sense to keep it close to the JS regex method names. If you're still keen on trying this out, I'd definitely accept a PR :)

lachenmayer avatar Jul 07 '20 11:07 lachenmayer

You could have an argument called match

You mean like rename search above? Sounds good.

(& possibly also test?)

Like as a subfield? Not sure how to make theString-types return Boolean, and still be compatible. I am also really trying to think about how to do multiple-matches, which I think would be important for a lot of use-cases. Any ideas? It's sort of the same problem (don't want to change the signature from String to [String].)

konsumer avatar Jul 08 '20 02:07 konsumer

Maybe I should start with just search and single returns per parent (as param to text,attrib, etc.)

Or maybe I should make a new kind of parent element, like search or match, that returns [String] and another, maybe called test, that returns Boolean.

Maybe both use-cases could be met (sort of) by something that works like queryAll/query but runs regex match on the matched html, something like this:

page(url: "https://play.google.com/store/apps/details?id=com.tinybuildgames.helloneighbor&hl=en_US"){
  developer: regex(selector: "a[href^='/store/apps/dev']") {
    id:match(regex: "/store/apps/dev?id=(.+)\"")
  }
}

It's not perfect, as it confusingly mixes regex and attrib-grabbing in a weird way. I'm going to keep thinking on it.

konsumer avatar Jul 08 '20 21:07 konsumer

I think a completely separate type for these definitely makes sense. Let me know once you've found a schema for this that you're happy with! :)

lachenmayer avatar Jul 13 '20 14:07 lachenmayer