crawlee
crawlee copied to clipboard
Integrate adblocker functionality
Interesting tip from HN (for Dashblock): Maybe you already do it, but I think integrating adblocker functionality when loading JS sites would be desirable to reduce load time. And if ads are what the API user is interested in, perhaps add a flag for whether or not one wants ads to load. Recommendation: https://github.com/cliqz-oss/adblocker Should be the fastest adblocker library (used by Ghostery, Cliqz and Brave)
This could be integrated into Apify.launchPuppeteer()
function as useAdBlock: true
option.
https://sdk.apify.com/docs/api/apify#module_Apify.launchPuppeteer
Greetings. So the thing would be to implement ad blocker to increase the speed of the scrap/crawl? I could work on this 🙏
Yes exactly, it could boost the speed especially for some websites that are heavy on ads (news sites). But it would be great to first test this assumption. Would you be interested also in trying this out? Use Apify SDK to run scraper with and without ad blocker against some websites?
Sure! I can set up a test and run it to check this first with some timing debug, I'll create it and run it, then attach it here for you to see, thank you 🚀
interesting. I manually block all the common ad networks using blockRequests, this would offload the task to the extension
Makes sense for a lot of users I guess but fyi it's an explicit anti-feature with usecase-killing effect for me. I'd need this off with zero sideeffects on current behavior.
Makes sense for a lot of users I guess but fyi it's an explicit anti-feature with usecase-killing effect for me. I'd need this off with zero sideeffects on current behavior.
In the small POC I proposed a while ago https://github.com/apify/apify-js/pull/600, the feature is completely disabled by default and only does some work when blocking is enabled by the user.
Yeah, sorry @remusao . We still have not figured out if the performance will improve or not. I apologize.
Yeah, sorry @remusao . We still have not figured out if the performance will improve or not. I apologize.
Of course, no worries at all, I just wanted to make clear to @matjaeck that there should be a way to integrate such a feature without any overhead when it's disabled.