cheerio icon indicating copy to clipboard operation
cheerio copied to clipboard

Add an `extract` method

Open fb55 opened this issue 2 years ago • 26 comments

One common use-case for cheerio is to extract multiple values from a document, and store them in an object. Doing so manually currently isn't a great experience. There are several packages built on top of cheerio that improve this: For example https://github.com/matthewmueller/x-ray and https://github.com/IonicaBizau/scrape-it. Commercial scraping providers also allow bulk extractions: https://www.scrapingbee.com/documentation/data-extraction/

We should add an API to make this use-case easier. The API should be along the lines of:

$.extract({
  // To get the `textContent` of an element, we can just supply a selector
  title: "title",

  // If we want to get results for more than a single element, we can use an array
  headings: ["h1, h2, h3, h4, h5, h6"],
  // If an array has more than one child, all of them will be queried for.
  // Complication: This should follow document-order.
  headings: ["h1", "h2", "h3", "h4", "h5", "h6"], // (equivalent to the above)

  // To get a property other than the `textContent`, we can pass a string to `out`. This will be passed to the `prop` method.
  links: [{ selector: "a", out: "href" }],

  // We can map over the received elements to customise the behaviour
  links: [
    {
      selector: "a",
      out(el, key, obj) {
        const $el = $(el);
        return { name: $el.text(), href: $el.prop("href") };
      },
    },
  ],

  // To get nested elements, we can pass a nested extract object. Selectors inside the nested extract object will be relative to the current scope.
  posts: [
    {
      selector: ".post",
      out: {
        title: ".title",
        body: ".body",
        link: { selector: "a :has(> .title)", out: "href" },
      },
    },
  ],

  // We can skip the selector in nested extract objects to reference the current scope.
  links: [
    {
      selector: ".post a",
      out: {
        href: { out: "href" },
        name: { out: "textContent" },
        // Equivalent — get the text content of the current scope
        name: "*",
      },
    },
  ],
});

fb55 avatar May 06 '22 12:05 fb55

I'm building something to do this right now. I dig the direction you're going. Some ideas...

What if an array meant that an array should be returned? In addition to being more ergonomic, I think it will also ensure that the result's type is properly inferred. Also, I think there's a pretty easy way to allow objects to mean objects, even when there are config objects in the mix...

const res = $.extract({
  singleStr: 'h1', // throws if more than one element selected
  arrayOfStr: ['h1, h2'], // uses multiple selectors
  tupleOfStr: ['h1', 'h2'], // literally a tuple, throws if either selector returns more than one item
  arrayOfObj: [{
    // an object is a config if it exactly matches config type, otherwise object return is expected
    int: {selector: '#length'},
  }],
});

That object idea does leave room for ambiguity, and it will be a bit annoying to type. What about support for nested $.extract()? Also, I really like your ideas on scoped sub-selectors ({out: {prop, prop, prop}}), so what about $.extract('#scope', {})?

const res = $.extract({
  // nested object
  meta: $.extract('#scope', {
    deep: meta: $.extract({}),
  }),
});

Regarding scoping for performance, do you think there's major gains to be had from scanning the entire extract tree to optimize all the selectors automatically? In my scraping, I trawl every bit of the dom for data redundancy, but the selectors are grouped by the desired bit of data, not their position in the dom. I often wonder if I'm missing out, but I haven't had a chance to test it.


Most scraping also includes data processing. How about first-class support for funcs? Here too, you can infer types, including and for more than just strings...

const res = $.extract({
  singleStr: ($) => $('h1').text().trim(),
  arrayOfStr: [
    ($) => $('h1, h2').toArray().map((el) => $(el).text().trim()),
  ],
  arrayOfObj: [{
    int: ($) => parseInt($('#length').text()) || 0,
  }],
});

The last thing I'll comment on is out. I think it might be a little overloaded. Maybe better would be...

type SelectConfig = {
  selector?: string;
  // XOR these...
  parse?: <T>($: CheerioAPI) => T;
  content?: 'text' | 'html';
  prop?: keyof HTMLElement;
  attr?: string;
  data?: string; 
  style?: string;
}

Anyway, really cool you're thinking about this direction! This really must be a huge portion of what Cheerio users are doing.

mikestopcontinues avatar May 15 '22 13:05 mikestopcontinues

Thanks for the feedback! See some responses below.


What if an array meant that an array should be returned?

That's the idea!

tupleOfStr: ['h1', 'h2'], // literally a tuple, throws if either selector returns more than one item

An individual selector should stand for the first match; I've added a limit option to cheerio-select that will enable us to implement this: https://github.com/cheeriojs/cheerio-select/pull/307

The idea with multiple array elements was to allow users to extract different properties. Eg.

$.extract({
	titles: [
		// The document's `<title>` tag. Will use the `textContent`.
		'title',
		// The Open Graph `title` property. Will use the `content` attribute.
		{ selector: 'meta[property="og:title"]', out: 'content' }
	}]
})

Ideally, there should still be a way to limit the number of elements retrieved. That way, we could support use-cases such as https://github.com/microlinkhq/metascraper/blob/b3379a9300ad1ed6de155592866b1e555e1f5382/packages/metascraper-title/index.js

what about $.extract('#scope', {})

I tried to model this by allow out to be an object — from the example above:

$.extract({
  posts: [
    {
      selector: ".post",
      out: {
        title: ".title",
        body: ".body",
        link: { selector: "a :has(> .title)", out: "href" },
      },
    },
  ],
})

This extracts the title, body and link for every post; all the nested selectors are relative to .post.

do you think there's major gains to be had from scanning the entire extract tree to optimize all the selectors automatically?

Yes, although this is quite complicated to do and won't be a part of the initial version of this.

How about first-class support for funcs?

I tried to achieve this by allowing functions for the out property. It might make sense to allow functions for the selector as well.

The last thing I'll comment on is out. I think it might be a little overloaded.

There are currently three different values: (1) a string that will be passed to Cheerio's prop method, (2) an object that will be used as a nested object, and (3) a function that will be called with the object.

If we didn't overload the object, the alternative would be runtime errors for users that don't use TS. Removing that potential issue seems worth the added complexity.

As for using prop: This is a neat way of allowing the most common extractions. It supports attributes, serialisation types (innerHTML, outerHTML, textContent, innerText), and it is able to resolve links (as of #2510).

fb55 avatar May 16 '22 09:05 fb55

$.extract({
	titles: [{
		// The document's `<title>` tag. Will use the `textContent`.
		'title',
		// The Open Graph `title` property. Will use the `content` attribute.
		{ selector: 'meta[property="og:title"]', out: 'content' }
	}]
})

Just to clarify, I like this solution. Two selectors within an array meaning two values. I'm not sure if it's just me, but I still read your initial spec to mean that ['h1, h2'] === ['h1', 'h2'].

Ideally, there should still be a way to limit the number of elements retrieved. That way, we could support use-cases such as https://github.com/microlinkhq/metascraper/blob/b3379a9300ad1ed6de155592866b1e555e1f5382/packages/metascraper-title/index.js

FWIW, this is exactly how I scrape data now. No knowing when Amazon is going to change their DOM, so I have a bunch of selectors for each bit of data, plus a test that picks the best match. I know .extract won't go that far, but I figure it's worth raising a use-case.

what about $.extract('#scope', {})

The one thing about {selector, out: {title, body, link}} is that it requires all nested objects to have scope. Unless selector is optional, of course. Given performance considerations, there's value to the extract config paralleling the DOM structure. But I guess the question is if the API should push in that direction.

I'm still inclined to want to keep all my selectors grouped by the data they return (a la the above example) because it makes it much easier to process in the next step. (Otherwise I need to maintain two mappings, rather than just one.)

I tried to achieve this by allowing functions for the out property. It might make sense to allow functions for the selector as well.

I don't think there'd be any benefit if you still had to nest the function. A string selector with an out func accomplishes the same thing. I was just thinking about streamlining the interface a bit. It's okay either way.

There are currently three different values: (1) a string that will be passed to Cheerio's prop method, (2) an object that will be used as a nested object, and (3) a function that will be called with the object.

I just took a look at the prop API. It's much cooler than I'd realized. Maybe the only thing I'll suggest then is that the prop name be changed from out to value. Semantically, it feels more like it contains prop/parsing functionality better.

mikestopcontinues avatar May 16 '22 21:05 mikestopcontinues

Hi and thanks for the great library!

I noticed the extract method in the docs https://cheerio.js.org/interfaces/CheerioAPI.html#extract and in the tests https://github.com/cheeriojs/cheerio/blob/dec7cdc9ad21a1fc5667a2ed015aba9ee3b47e5f/src/api/extract.spec.ts, but when I try to use it:

const $ = cheerio.load('<div>hello</div>')
$.extract({
  div: 'div',
})

cheerio blows up with

$.extract({
  ^

TypeError: $.extract is not a function

I'm using version 1.0.0-rc.12.

mvasin avatar Dec 15 '22 22:12 mvasin

This was just merged and a new release hasn't been issued yet. I'm working through my list for remaining changes, so this hopefully won't take long.

fb55 avatar Dec 15 '22 22:12 fb55

Hi, It looks like this isnt released yet. Any timing updates?

b6t avatar Feb 25 '23 02:02 b6t

image

Appears to not match documentation.

sroussey avatar Feb 28 '23 17:02 sroussey

Any estimation for the new release?

Carleslc avatar Mar 10 '23 14:03 Carleslc

What happened to this feature, its exactly what I needed and seems to be documented, but it doesn't seem to be available?

anthonycmain avatar Mar 21 '23 20:03 anthonycmain

Liked what's been discussed here.

Needed this and grew tired of waiting for Cheerio, so I just published my implementation of these ideas + own takes: https://www.npmjs.com/package/cheerio-json-mapper

Might be useful to others as well.

denkan avatar Mar 28 '23 15:03 denkan

Liked what's been discussed here.

Needed this and grew tired of waiting for Cheerio, so I just published my implementation of these ideas + own takes: https://www.npmjs.com/package/cheerio-json-mapper

Might be useful to others as well.

Thanks for writing this and sharing @denkan, I've been playing with it this evening and its exactly what I need, I will feed back any bugs I find in your own github repo

anthonycmain avatar Mar 29 '23 20:03 anthonycmain

Any update ? it's still in doc but not in code

quentinlamamy avatar Aug 20 '23 20:08 quentinlamamy

This was just merged and a new release hasn't been issued yet. I'm working through my list for remaining changes, so this hopefully won't take long.

Life... am i right? Great work so far on this all ya'all. Much needed library for sure. Keep up the good work.

archae0pteryx avatar Sep 04 '23 15:09 archae0pteryx

This should not be documented in the user guide, if it's not actually released yet: https://cheerio.js.org/docs/advanced/extract

dRoskar avatar Sep 14 '23 09:09 dRoskar

This should not be documented in the user guide, if it's not actually released yet: https://cheerio.js.org/docs/advanced/extract

Since the website is also here in this repo, perhaps it would be better to have each release with a corresponding tag. And only the latest released version of the website (with relevant docs) would get actually deployed to the web. Just suggesting. But otherwise cheerio looks solid. Thanks to the contributors!

ivanakcheurov avatar Sep 14 '23 13:09 ivanakcheurov

Ugh, why is this feature documented if it's not actually released yet? 😢

adamreisnz avatar Sep 30 '23 03:09 adamreisnz

Super confusing and time consuming to read docs added by this commit https://github.com/cheeriojs/cheerio/pull/2950/commits/976b087d3ba8ed2f5b3beea6fffc83a003195f21 for a proposed feature with no apparent implementation work evident in the repo. A new user like me, while not wanting to be mistaken for an ungrateful or entitled whiner, is left wondering if this kind of thing is representative of what I should expect from the rest of cheerio or if this is a rare exception.

christo avatar Nov 09 '23 07:11 christo

@christo https://github.com/cheeriojs/cheerio/blob/main/src/api/extract.ts

fb55 avatar Nov 09 '23 18:11 fb55

Remove it from the docs, if its not in the latest release.

bluescorpian avatar Nov 20 '23 11:11 bluescorpian

Where is extract function? There is none on Root 😭

piscopancer avatar Jan 10 '24 08:01 piscopancer

May 2024, still not implemented and still on the docs? Or why am I getting TypeError: $.extract is not a function? Very confusing!

sebagr avatar May 22 '24 12:05 sebagr