metascraper icon indicating copy to clipboard operation
metascraper copied to clipboard

[metascraper-amazon] Image selector matches incorrect image

Open agchou opened this issue 7 years ago • 20 comments

I'm running into issues with the image value not being the main image for metascraper-amazon. There are actually multiple .a-dyanmic-image classes on the screen as seen in the attached photo. Can we create some rules with priority over this like wrapUrl($ => $('#landingImage').attr('src')) or wrapUrl($ => $('.a-dynamic-image').first().attr('src'))?

screen shot 2018-01-17 at 8 53 27 pm

agchou avatar Jan 18 '18 04:01 agchou

yeah, of course, just add the right rule here: https://github.com/microlinkhq/metascraper/blob/master/packages/metascraper-amazon/index.js#L51

Can you specific the URl for creating a unit test?

Kikobeats avatar Jan 18 '18 08:01 Kikobeats

Hey @agchou, I think you create your own package for support this new custom rule.

Can you share with us? I want to improve this in the metascraper-amazon package 😄

Kikobeats avatar Jan 22 '18 10:01 Kikobeats

Happy to accept improvements over metascraper-amazon; I'm going to close the issue since it's old; If the package looks outdated for you, please ping me!

Kikobeats avatar Jul 27 '19 06:07 Kikobeats

Yea @Kikobeats I get just a tiny 1 pixel image every time. How do we go about fixing this? Can the rule be overridden?

swolidity avatar Aug 10 '19 08:08 swolidity

@andyk2177 need to add the specific rule for contemplating that case.

Please, share the URL that is causing this behavior.

We can add a code ward to don't consider images with less than N pixels.

Kikobeats avatar Aug 10 '19 09:08 Kikobeats

Well, seems to be any Amazon link for me that is doing it but here is an example https://www.amazon.com/JNH-Lifestyles-Canadian-Hemlock-Infrared/dp/B00F2Y5B6W?tag=profiledotim-20

swolidity avatar Aug 10 '19 09:08 swolidity

I think we might just need a more specific class name to grab maybe?

swolidity avatar Aug 10 '19 09:08 swolidity

@andyk2177 yes, you're right, the problem is Amazon has a lot of different product views; need to setup the rules in a way we can maximize get the proper image.

Can you make a PR? Just you need is to add the specific image selector here.

Kikobeats avatar Aug 10 '19 09:08 Kikobeats

Ok sure, why are there two selectors though? Which one is prioritized? So for example with my url above I get this image back but the page does have a data-old-hires attribute so not sure why that one wasn't prioritized?

swolidity avatar Aug 10 '19 09:08 swolidity

the best way to determinate that is adding a test per every link and be sure the output is the thing you expect

Kikobeats avatar Aug 10 '19 09:08 Kikobeats

Getting "robot check" every link I've tried for an amazon product -- anyone else seeing this?

Example URL: https://www.amazon.com/dp/B07SY4C5QF/ref=cm_sw_r_tw_apa_i_2qJLDbGGS3H0Q

bobber205 avatar Oct 03 '19 17:10 bobber205

@bobber205 it's probably because your User-Agent header looks like it is automated and coming from a script ( it is ) but you should be able to set it to anything you want. I'm setting it to a browser like this:

try {
    const { body: html } = await got(url, {
      headers: {
        "User-Agent": req.headers["user-agent"]
      }
    });
    data = await metascraper({ url, html });
    statusCode = 200;
  } catch (err) {
    statusCode = 401;
    data = {
      message: `Scraping the open graph data from "${url}" failed.`,
      suggestion:
        "Make sure your URL is correct and the webpage has open graph data, meta tags or twitter card data."
    };

swolidity avatar Oct 03 '19 18:10 swolidity

@bobber205

What kind of data are you interested in?

Looks almost all the data is there using Microlink API

https://api.microlink.io/?url=https%3A%2F%2Fwww.amazon.com%2Fdp%2FB07SY4C5QF

Kikobeats avatar Oct 03 '19 18:10 Kikobeats

Good advice on setting the user agent!

I've set it to

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36

That's what google says is the latest User Agent for Chrome. I don't see "Robot Check" anymore but I do get https://fls-na.amazon.com/1/batch/1/OP/ATVPDKIKX0DER:144-1080801-5689911:ADD1XNCC3BW7K9PR531T$uedata=s:%2Fdp%2FB07SY4C5QF%2Fref%3Dcm_sw_r_tw_apa_i_2qJLDbGGS3H0Q%2Fuedata%2Fnvp%2Funsticky%2F144-1080801-5689911%2FNoPageType%2Fntpoffrw%3Fstaticb%26id%3DADD1XNCC3BW7K9PR531T%26pty%3DDetail%26spty%3DGlance%26pti%3DB0798MSV1F:1000 (a large black image) for the image. :(

bobber205 avatar Oct 03 '19 19:10 bobber205

@Kikobeats I'm looking for the image mostly. The rest is coming through great once I've set the user agent

bobber205 avatar Oct 03 '19 20:10 bobber205

Hey! So, I'm getting everything required except the product's image from this amazon URL.

Looked at the selectors that are used by metascraper; those parts exist in the html but seem empty. The actual image that should be extracted doesn't have a class or an id. It can be found within a div that has the "digitalMusicProductImage_feature_div" id.

Example URL: https://www.amazon.de/Vienna-Bolling-Project-»Classic-Jazz«/dp/B003604LHE

Is there anything to do with this @Kikobeats ?

Thanks!

pdesmarais avatar Oct 28 '19 17:10 pdesmarais

@pdesmarais perhaps https://microlink.io/docs/mql/getting-started/overview can help us out here?

swolidity avatar Oct 29 '19 16:10 swolidity

@pdesmarais Have you tried setting the useProxy init variable true?

bobber205 avatar Oct 29 '19 16:10 bobber205

@bobber205 where do you set that? I don't see it in the docs?

swolidity avatar Oct 29 '19 16:10 swolidity

ah I was confusing this with the opengraph paid product. Sorry :(

bobber205 avatar Oct 29 '19 17:10 bobber205