metascraper
metascraper copied to clipboard
[metascraper-amazon] Image selector matches incorrect image
I'm running into issues with the image value not being the main image for metascraper-amazon. There are actually multiple .a-dyanmic-image classes on the screen as seen in the attached photo. Can we create some rules with priority over this like wrapUrl($ => $('#landingImage').attr('src')) or wrapUrl($ => $('.a-dynamic-image').first().attr('src'))?

yeah, of course, just add the right rule here: https://github.com/microlinkhq/metascraper/blob/master/packages/metascraper-amazon/index.js#L51
Can you specific the URl for creating a unit test?
Hey @agchou, I think you create your own package for support this new custom rule.
Can you share with us? I want to improve this in the metascraper-amazon package 😄
Happy to accept improvements over metascraper-amazon; I'm going to close the issue since it's old; If the package looks outdated for you, please ping me!
Yea @Kikobeats I get just a tiny 1 pixel image every time. How do we go about fixing this? Can the rule be overridden?
@andyk2177 need to add the specific rule for contemplating that case.
Please, share the URL that is causing this behavior.
We can add a code ward to don't consider images with less than N pixels.
Well, seems to be any Amazon link for me that is doing it but here is an example https://www.amazon.com/JNH-Lifestyles-Canadian-Hemlock-Infrared/dp/B00F2Y5B6W?tag=profiledotim-20
I think we might just need a more specific class name to grab maybe?
@andyk2177 yes, you're right, the problem is Amazon has a lot of different product views; need to setup the rules in a way we can maximize get the proper image.
Can you make a PR? Just you need is to add the specific image selector here.
Ok sure, why are there two selectors though? Which one is prioritized? So for example with my url above I get this image back but the page does have a data-old-hires attribute so not sure why that one wasn't prioritized?
the best way to determinate that is adding a test per every link and be sure the output is the thing you expect
Getting "robot check" every link I've tried for an amazon product -- anyone else seeing this?
Example URL: https://www.amazon.com/dp/B07SY4C5QF/ref=cm_sw_r_tw_apa_i_2qJLDbGGS3H0Q
@bobber205 it's probably because your User-Agent header looks like it is automated and coming from a script ( it is ) but you should be able to set it to anything you want. I'm setting it to a browser like this:
try {
const { body: html } = await got(url, {
headers: {
"User-Agent": req.headers["user-agent"]
}
});
data = await metascraper({ url, html });
statusCode = 200;
} catch (err) {
statusCode = 401;
data = {
message: `Scraping the open graph data from "${url}" failed.`,
suggestion:
"Make sure your URL is correct and the webpage has open graph data, meta tags or twitter card data."
};
@bobber205
What kind of data are you interested in?
Looks almost all the data is there using Microlink API
https://api.microlink.io/?url=https%3A%2F%2Fwww.amazon.com%2Fdp%2FB07SY4C5QF
Good advice on setting the user agent!
I've set it to
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
That's what google says is the latest User Agent for Chrome. I don't see "Robot Check" anymore but I do get https://fls-na.amazon.com/1/batch/1/OP/ATVPDKIKX0DER:144-1080801-5689911:ADD1XNCC3BW7K9PR531T$uedata=s:%2Fdp%2FB07SY4C5QF%2Fref%3Dcm_sw_r_tw_apa_i_2qJLDbGGS3H0Q%2Fuedata%2Fnvp%2Funsticky%2F144-1080801-5689911%2FNoPageType%2Fntpoffrw%3Fstaticb%26id%3DADD1XNCC3BW7K9PR531T%26pty%3DDetail%26spty%3DGlance%26pti%3DB0798MSV1F:1000 (a large black image) for the image. :(
@Kikobeats I'm looking for the image mostly. The rest is coming through great once I've set the user agent
Hey! So, I'm getting everything required except the product's image from this amazon URL.
Looked at the selectors that are used by metascraper; those parts exist in the html but seem empty. The actual image that should be extracted doesn't have a class or an id. It can be found within a div that has the "digitalMusicProductImage_feature_div" id.
Example URL: https://www.amazon.de/Vienna-Bolling-Project-»Classic-Jazz«/dp/B003604LHE
Is there anything to do with this @Kikobeats ?
Thanks!
@pdesmarais perhaps https://microlink.io/docs/mql/getting-started/overview can help us out here?
@pdesmarais Have you tried setting the useProxy init variable true?
@bobber205 where do you set that? I don't see it in the docs?
ah I was confusing this with the opengraph paid product. Sorry :(