opensea-scraper
opensea-scraper copied to clipboard
[BUG] `offersByScrolling()` and `offersByScrollingByUrl()` not properly working
I noticed that the function offersByScrolling() and offersByScrollingByUrl() is not working properly. Most of the offers are not scraped (a lot of them are skipped for some reason, approximately 75% of the offers are not saved). This leads to the function being stuck for a long time, as it takes a lot longer to scrape the desired amount of offers when 75% of the offers are not scraped.
If anyone experiences this too and relies on this function please comment below so I know its urgent 📝
This is exactly the problem I'm running into right now. should I close the other issue? I didn't find a fix until now but I'll also keep looking into this. It also still doesn't occur when choosing "total_volume" instead of the other options.
Oh yeah you're right, somehow I didn't realize this is the same bug than you reported, I randomly noticed it during testing. Closing the other issue #34 as its the same.
@SKreutz do you need to scrape multiple pages or are the first 100 sufficient? Because there is a way of getting the top 100 elements without scrolling, just run this script:
const nextDataStr = document.getElementById("__NEXT_DATA__").innerText;
const nextData = JSON.parse(nextDataStr);
const top100 = nextData.props.relayCache[0][1].json.data.rankings.edges.map(obj => obj.node);
This is way faster and more efficient than scrolling and scraping the data from the DOM. I will integrate this in the repository soon and add the following functions:
OpenseaScraper.rankings("24h"); // https://opensea.io/rankings?sortBy=one_day_volume
OpenseaScraper.rankings("7d"); // https://opensea.io/rankings?sortBy=seven_day_volume
OpenseaScraper.rankings("30d"); // https://opensea.io/rankings?sortBy=thirty_day_volume
OpenseaScraper.rankings("total"); // https://opensea.io/rankings?sortBy=total_volume
// ❌ currently not working: scrape more than 100 items from rankings page
OpenseaScraper.rankingsByScrolling();
@dcts I only want to scrape the first 100 slugs yes. Where do I put the 3 lines of code you provided? Thank you for your help I really appreciate it!
@SKreutz I added this new method and updated the repository, just update to the latest version 6.0.0 and then you can do:
// scrape all slugs, names and ranks from the top collections from the rankings page
// "type" is one of the following:
// "24h": ranking of last 24 hours: https://opensea.io/rankings?sortBy=one_day_volume
// "7d": ranking of last 7 days: https://opensea.io/rankings?sortBy=seven_day_volume
// "30d": ranking of last 30 days: https://opensea.io/rankings?sortBy=thirty_day_volume
// "total": scrapes all time ranking: https://opensea.io/rankings?sortBy=total_volume
const type = "24h"; // possible values: "24h", "7d", "30d", "total"
const ranking = await OpenseaScraper.rankings(type, options);
@dcts your fix seems to work fine! Really appreciate your help. It's even a lot faster than before. This bug can be closed.
How come the issue has been closed ? Has the OpenseaScraper.offersByScrolling() method been fixed ?
It seems to me that the issue first expressed in this ticket is still happening, but you found a workaround for the rankings case. Is there something I am not interpreting correctly ?
Not sure it is the same issue, but when running our script we get "stats":{"totalOffers":416} even though the offers field only contains 410 elements after calling scraper.offersByScrolling when running the script locally. In production on GCP, we get an empty result that lookis like
offers: []
stats: {}
Something is definitely wrong with this method... What can we do to help investiguate the issue?
@mlarcher I just checked and yes, you are absolutely right, the issue was never resolved. Thanks for reporting!
I need to take a closer look at the code, something happend that broke the code.
I just tried to repoduce the issue. When I try to check for example "slotienft" with currently 390 items on "buy now" and using the offers method works fine:
=== actions === new page created opening url https://opensea.io/collection/slotienft?search[sortAscending]=true&search[sortBy]=PRICE&search[toggles][0]=BUY_NOW 🚧 waiting for cloudflare to resolve... extracting wired variable closing browser... extracting offers and stats from wired variable total Offers: 390 top 3 Offers [ { name: 'Slotie #4606', tokenId: '4606', displayImageUrl: 'https://lh3.googleusercontent.com/6YxBtVI9cA4Y2kEMujrGodnXk55lEiJXRCdLDnGbwQRmpBI26Va7_BU7tmBvWYJz1YQz1lwGRuCZP_UtKHndL14Zj4qXwpy-Jfc8', assetContract: '0x5fdb2b0c56afa73b8ca2228e6ab92be90325961d', offerUrl: 'https://opensea.io/assets/0x5fdb2b0c56afa73b8ca2228e6ab92be90325961d/4606', floorPrice: { amount: 0.685, currency: 'ETH' } . . .
Scraping offers by scrolling also works fine for me.
✅ === OpenseaScraper.offersByScrolling(slug, 40) === === scraping started === Scraping Opensea URL: https://opensea.io/collection/slotienft?search[sortAscending]=true&search[sortBy]=PRICE&search[toggles][0]=BUY_NOW
=== options === debug : false logs : true browserInstance: default
=== actions === new page created 🚧 waiting for cloudflare to resolve expose all helper functions scrape offers until target resultsize reached or bottom of page reached closing browser... total Offers: 390 all scraped offers (max 40): [
I also tried different collections. Everything works fine for me. I am using Mac OS Monetery 12.0.1 and Node v16.13.1. I also just downloaded the latest version of opensea scraper
Let me know if you need further information
Here's what I get:
server_1 | 2022-03-17T22:00:43.174Z debug: Start scraping prices
server_1 | === scraping started ===
server_1 | Scraping Opensea URL: https://opensea.io/collection/chumbivalleyofficial?search[sortAscending]=true&search[sortBy]=PRICE&search[toggles][0]=BUY_NOW
server_1 |
server_1 | === options ===
server_1 | debug : false
server_1 | logs : true
server_1 | browserInstance: default
server_1 |
server_1 | === actions ===
server_1 | new page created
server_1 | 🚧 waiting for cloudflare to resolve
server_1 | expose all helper functions
server_1 | scrape offers until target resultsize reached or bottom of page reached
server_1 | closing browser...
server_1 | 2022-03-17T22:11:17.853Z debug: Prices scraping done [{"foundOffersCount":408,"stats":{"totalOffers":412}}]
I'm on MacOS Monterey 12.3 in a docker container running node:16.14.0-alpine3.14
@mlarcher I published a fix, can you test and let me know if it works now, be sure to use version 6.0.2 :)
@SKreutz thanks for testing! I think it might have looked like everything works on your end, but in fact a lot of the offers were missing when using the offersByScrolling method. The bug was that 80% of the offers were skipped, only ~20% got scraped. This is particularly bad because sometimes it might seem that everything works, whereas it actually did not. And other times it just broke.
But now it should be fixed, at least the demo is working again (for me) with all relevant offers scraped. You can test it with
npm run demo
@dcts it's @SKreutz who said "Scraping offers by scrolling also works fine for me" not me...
I just tested the 6.0.2 version, I got [{"foundOffersCount":412,"stats":{"totalOffers":413}}] so one offer is still missing in the offers array. I'm running it a second time to be sure, but I see 413 on opensea right now, so thre's probably still something going on.
second run got me [{"foundOffersCount":405,"stats":{"totalOffers":413}}] so we're not good yet :/
also, is there any chance it works on GCP with current version, or is it an unrelated problem that I get empty results in production ?
@mlarcher can you post what collection you scraped that got you these results?
@dcts it's @SKreutz who said "Scraping offers by scrolling also works fine for me" not me... I just tested the 6.0.2 version, I got
[{"foundOffersCount":412,"stats":{"totalOffers":413}}]so one offer is still missing in theoffersarray. I'm running it a second time to be sure, but I see 413 on opensea right now, so thre's probably still something going on.
here it is @dcts :
server_1 | Scraping Opensea URL: https://opensea.io/collection/chumbivalleyofficial?search[sortAscending]=true&search[sortBy]=PRICE&search[toggles][0]=BUY_NOW
When I run the following:
const res = await OpenseaScraper.offersByScrolling("chumbivalleyofficial", 40, options);
I get correct results, in fact, they are identical to running OpenseaScraper.offers("chumbivalleyofficial",options).
Can you try to run it locally (not on GCP)?
also, is there any chance it works on GCP with current version, or is it an unrelated problem that I get empty results in production ?
To answer your question: yes, its an unrelated problem that has nothing to do with the scraper, but with the environment. Cloud setups for scraping are always difficult because you don't have full control over the environment, ips etc. Also services like cloudflare can detect a cloud environment (through IP lists) and handle them differently (block them). See issues #40 #39. In case I find a solution for the cloud I will certainly share, but as of now I don't plan to work on that. But I encourage everybody to share working cloud setups, because it is a common thing that certainly a lot of people would like.
@dcts thanks for the information.
GCP is not at stake here, as we have absolutely no result at all there (even if it used to work at some point before). I'll check if I can do anything to change the script's external ip.
What I was giving are results in a docker container on my machine.
Your test got me thinking, and I tried directly on the host machine with no docker container involved and got the same issue : [{"foundOffersCount":419,"stats":{"totalOffers":422}}]
In your test you are limiting the results to 40, which is a way of avoiding the issue, but we want a way larger result set. There are about 420 items on sell, not 40... Maybe you could try on your machine with a limit set at 500 ?
Please let me know what else we can do to help investigate the issue.
@mlarcher I tried the same with 500 and could replicate the inconsistency. Here are my results:
const res = await OpenseaScraper.offersByScrolling("chumbivalleyofficial", 500, options);
console.log(res.offers.length); // => 420
console.log(res.stats.totalOffers); // => 428
So yes theres still an issue. But can you confirm that you at least get the algorithm running and you get most of the offers? (even if its not all of them)? You could get 419 offers out of 422, is that right? 🤔
I think some offers don't get fetched because of how the scraping algorithm is designed:
- the algorithm keeps scrolling as long as possible
- scrolling triggers fetching of new data, which changes the DOM
- then the algorithm gets the data from the DOM This is obviously not a great design, as its very error prone. What if the DOM is being checked before the data has been inserted? or what if the fetching fails? In those cases the algorithm would simply skip and continue.
I am sure there is a better solution, and I agree would be great to have but, but on the other hand I did not yet come up with an idea on how to better solve this problem.
@mlarcher I tried the same with 500 and could replicate the inconsistency. Here are my results:
const res = await OpenseaScraper.offersByScrolling("chumbivalleyofficial", 500, options); console.log(res.offers.length); // => 420 console.log(res.stats.totalOffers); // => 428So yes theres still an issue. But can you confirm that you at least get the algorithm running and you get most of the offers? (even if its not all of them)? You could get 419 offers out of 422, is that right? 🤔
I think some offers don't get fetched because of how the scraping algorithm is designed:
- the algorithm keeps scrolling as long as possible
- scrolling triggers fetching of new data, which changes the DOM
- then the algorithm gets the data from the DOM This is obviously not a great design, as its very error prone. What if the DOM is being checked before the data has been inserted? or what if the fetching fails? In those cases the algorithm would simply skip and continue.
I am sure there is a better solution, and I agree would be great to have but, but on the other hand I did not yet come up with an idea on how to better solve this problem.
I also thinks it’s not possible to fetch 100% because of the way opensea uses to display the items and as you mentioned the DOM changes. When scrolling manually and looking at the html, the DOM changes and adds the elements as they appear. Sometimes opensea is very slow or the nfts are gifs instead of jpegs which takes even longer and I think that’s why some items are skipped.
The only way to „fix“ this would in my opinion be to place a sleep of a few seconds after each „scroll“ so the items have more time to display. But I don’t know how the code works exactly and even that would not be a nice solution and it would make the code slow.
So yes theres still an issue. But can you confirm that you at least get the algorithm running and you get most of the offers? (even if its not all of them)? You could get 419 offers out of 422, is that right? 🤔 yes, that's it when run locally or in the docker cotainer on my home machine. On GCP I get no result at all, but as we saw it's not the same issue.
The only way to „fix“ this would in my opinion be to place a sleep of a few seconds after each „scroll“ so the items have more time to display. Perhaps a timeout after the last scroll only somehow ?
I'll check if there is a better way to know when the DOM is "stabilized"...
perhaps you could use something like https://developer.mozilla.org/fr/docs/Web/API/MutationObserver to monitor dom changes, scroll, and debounce an ending function until nothing moves anymore ?
perhaps you could use something like https://developer.mozilla.org/fr/docs/Web/API/MutationObserver to monitor dom changes, scroll, and debounce an ending function until nothing moves anymore ?
@mlarcher Yes this is a good idea, I tried this at some point but could not make it work, maybe worth a revisit.
Also what could be even more efficient is scrolling and simply controling puppeteer network activity, like this:
// taken from => https://stackoverflow.com/a/55478226/6272061
page.on('response', (response) => {
const headers = response.headers();
// example test: check if content-type contains javascript or html
const contentType = headers['content-type'];
if (textRegex.test(contentType)) {
console.log(response.url());
}
});
Once new data needs to be fetched the graphql API is called and when we intercept that request we get the data in this format:
{
"node": {
"assetCount": null,
"imageUrl": "https://lh3.googleusercontent.com/seJEwLWJP3RAXrxboeG11qbc_MYrxwVrsxGH0s0qxvF68hefOjf5qrPSKkIknUTYzfvinOUPWbYBdM8VEtGEE980Qv2ti_GGd86OWQ=s120",
"name": "DeadFellaz",
"slug": "deadfellaz",
"isVerified": true,
"id": "Q29sbGVjdGlvblR5cGU6OTM2MTIx",
"description": "10,000 undead NFTs on the Ethereum blockchain. Join the horde.\n\nAdditional official collections:\n\n[Halloween S1](https://opensea.io/collection/deadfellaz-infected-s1) | [Nifty Gateway Betty Pop Horror](https://opensea.io/collection/betty-pop-horror-by-deadfellaz) | [Deadfrenz Lab Access Pass](https://opensea.io/collection/deadfrenz-lab-access-pass) | [Deadfrenz Collection](https://opensea.io/collection/deadfrenz-collection)"
}
}

I think thats a nice solution and should be fairly easy to develop 🎉 Added it to the roadmap 🚔!
Side note: At that point it might be worth trying to use the opensea graphQL api but I never could make it work and I heard from people that its a pain to use.
Ups just realized that I posted the collection information above, the information for every single item (offer) looks like this:
{
"assetContract": {
"address": "0x2acab3dea77832c09420663b0e1cb386031ba17b",
"chain": "ETHEREUM",
"id": "QXNzZXRDb250cmFjdFR5cGU6MzAyOTQ1",
"openseaVersion": null
},
"collection": {
"isVerified": true,
"relayId": "Q29sbGVjdGlvblR5cGU6OTM2MTIx",
"id": "Q29sbGVjdGlvblR5cGU6OTM2MTIx",
"displayData": {
"cardDisplayStyle": "CONTAIN"
},
"imageUrl": "https://lh3.googleusercontent.com/seJEwLWJP3RAXrxboeG11qbc_MYrxwVrsxGH0s0qxvF68hefOjf5qrPSKkIknUTYzfvinOUPWbYBdM8VEtGEE980Qv2ti_GGd86OWQ=s120",
"slug": "deadfellaz",
"isAuthorizedEditor": false,
"name": "DeadFellaz"
},
"relayId": "QXNzZXRUeXBlOjM2Nzg2ODY0",
"tokenId": "3036",
"backgroundColor": null,
"imageUrl": "https://lh3.googleusercontent.com/RQlR9mw-oJyhrj_GtwRZfRJdqk-fjtbJK4tElqpas4R1XksLXqnklhvnbw40LHsVliYoDO3z9rWE7OczRKp_qhDqSS_ZNzyRa9kG",
"name": "DeadFellaz #3036",
"id": "QXNzZXRUeXBlOjM2Nzg2ODY0",
"isDelisted": false,
"animationUrl": null,
"displayImageUrl": "https://lh3.googleusercontent.com/RQlR9mw-oJyhrj_GtwRZfRJdqk-fjtbJK4tElqpas4R1XksLXqnklhvnbw40LHsVliYoDO3z9rWE7OczRKp_qhDqSS_ZNzyRa9kG",
"decimals": 0,
"favoritesCount": 23,
"isFavorite": false,
"isFrozen": false,
"hasUnlockableContent": false,
"orderData": {
"bestAsk": {
"relayId": "T3JkZXJWMlR5cGU6MzUyMjU2ODkzMQ==",
"orderType": "BASIC",
"maker": {
"address": "0x28705f64c07079822c7afd66e43975b7c6095ef6",
"id": "QWNjb3VudFR5cGU6MTQ1NjA1MTQy"
},
"closedAt": "2022-04-05T05:44:18",
"dutchAuctionFinalPrice": null,
"openedAt": "2022-03-17T21:48:42",
"priceFnEndedAt": null,
"quantity": "1",
"decimals": null,
"paymentAssetQuantity": {
"quantity": "2690000000000000000",
"asset": {
"decimals": 18,
"imageUrl": "https://openseauserdata.com/files/6f8e2979d428180222796ff4a33ab929.svg",
"symbol": "ETH",
"usdSpotPrice": 2946.32,
"assetContract": {
"blockExplorerLink": "https://etherscan.io/address/0x0000000000000000000000000000000000000000",
"chain": "ETHEREUM",
"id": "QXNzZXRDb250cmFjdFR5cGU6MjMzMQ=="
},
"id": "QXNzZXRUeXBlOjEzNjg5MDc3"
},
"id": "QXNzZXRRdWFudGl0eVR5cGU6Mjg3MDE4NzA3OTcyNTgyMjM1NjM1NTg1MDc0MTcxNjgyNzE3ODc4",
"quantityInEth": "2690000000000000000"
}
},
"bestBid": {
"orderType": "BASIC",
"paymentAssetQuantity": {
"asset": {
"decimals": 18,
"imageUrl": "https://openseauserdata.com/files/accae6b6fb3888cbff27a013729c22dc.svg",
"symbol": "WETH",
"usdSpotPrice": 2946.32,
"assetContract": {
"blockExplorerLink": "https://etherscan.io/address/0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2",
"chain": "ETHEREUM",
"id": "QXNzZXRDb250cmFjdFR5cGU6MjMzOA=="
},
"id": "QXNzZXRUeXBlOjQ2NDU2ODE="
},
"quantity": "1502841336452599400",
"id": "QXNzZXRRdWFudGl0eVR5cGU6MjEzNTc0NjA3Mzk2MzM3NzU2NjY4MTkxMzczOTUxNTUwMzAwMDE0"
}
}
},
"isEditable": {
"value": false,
"reason": "Unauthorized"
},
"isListable": true,
"ownership": null,
"creator": {
"address": "0xe9d30eddd11dea8433cf6d2b2c22e9cce94113dc",
"id": "QWNjb3VudFR5cGU6NjEyNTkxNTA="
},
"ownedQuantity": null,
"assetEventData": {
"lastSale": {
"unitPriceQuantity": {
"asset": {
"decimals": 18,
"imageUrl": "https://openseauserdata.com/files/6f8e2979d428180222796ff4a33ab929.svg",
"symbol": "ETH",
"usdSpotPrice": 2946.32,
"assetContract": {
"blockExplorerLink": "https://etherscan.io/address/0x0000000000000000000000000000000000000000",
"chain": "ETHEREUM",
"id": "QXNzZXRDb250cmFjdFR5cGU6MjMzMQ=="
},
"id": "QXNzZXRUeXBlOjEzNjg5MDc3"
},
"quantity": "1300000000000000000",
"id": "QXNzZXRRdWFudGl0eVR5cGU6MjQxMDUyNDMxOTA1OTU2ODY0MDMxNjQ3MTYzMjQyMzYyNTQ4MTkw"
}
}
}
}
@dcts hooking into the graphql API sounds like a wonderful idea. It could drastically improve the performance and avoid some DOM related pitfalls 👍
Side note: At that point it might be worth trying to use the opensea graphQL api but I never could make it work and I heard from people that its a pain to use.
Using the API would be nice, but from what I heard they don't give API tokens very easily, and even if granted an API Key you would be facing some limits/restrictions.
Also it seems the query they use on the site is not documented (AssetSearchQuery) and it requires an API key and a CSRF token that changes on every call, so I can see why it could be a pain to use...
using page.on('response', (response) => { sounds great though, as it would combine the best of both worlds. Any idea when you'll have time to give it a go ?
@mlarcher I'm working on it currently but not sure, depending on how long it will take to implement it could be today or next weekend maybe. But obviously no guarantees. ^^