browser-sdk
browser-sdk copied to clipboard
🙏 How to block DD RUM on email scanners, crawlers and bots in general?
Hi 👋
I'd wanted check what's the common way to avoid DD RUM Sessions when it's a email scanner, crawler or any kind of bot?
I'm already filtering out by the user agent as recommended on the docs here, but this hasn't been enough. There're still lot of sessions coming from email scanners. Is there other ways to filter them out?
Hello @giopetris, Can you tell me which field you are looking at to tell if the session comes from an email scanners? If it is something you can check on the client side you, could use it to filter out these sessions. In any case, for know, we don't have more than what is recommended on the docs here. However in the future we might provide a way to filter out RUM events. Maybe something similar to the exclusion filters from our Log product.
Hey, @giopetris coworker here. Thanks for the reply @amortemousque! Totally understand that this may not be solvable today but I can provide some additional context. Specifically in our environment we have noticed email scanners as the biggest culprit of bots. There are two scenarios we see most often:
- We send an email to a link that is behind a login wall so the replay is under a second and just a loading screen.
- We send an email link that is not behind a login wall but is typically a private hard-to-guess link and the replay loads the site, maybe checks a few links or inputs then exits.
The sessions themselves last more than 5 seconds.
We notice the majority of the email scanners are coming from the Azure public cloud network and we can identify this because of the geo.as.domain (microsoft.com) lookup that datadog does after collecting the ip address on the backend but is not available on the client. For us at least the majority of our frontend traffic should be organic through normal ISP's and not from cloud providers or through vpn on cloud provider.
The exclusion filters would be neat as a way of using the geo.as.domain and remove anything with microsoft.com
The user agent is unfortunately not useful and looks like just a normal browser.
Thanks for the additional context! These feedbacks are precious for us to provide the best solution.
We discovered that the bots scraping our site consistently visited the same routes and always had the same viewport and user agent. We made a function to filter on that. This fits our site very specifically, but in general it has worked well.
Ideally, we'd like to be able to filter out those events in the UI by adding certain IP addresses to a do-not-allow list, but this has worked in the meantime.
const viewport: boolean = window.innerWidth === 1004 && window.innerHeight === 676;
const userAgent: boolean = /(.+)(Windows|Linux)(.+)Chrome\/(108|91)(.+)/.test(window.navigator.userAgent);
const route: boolean = botRoutes.some((route) => route === window.location.pathname);