firecrawl
firecrawl copied to clipboard
Remove 'cookies' text when removing headers/footers, etc
Remove any cookies text when removing headers and footers. Many sites in Europe will display a cookie acceptance message Sometimes, this is the only text returned.
Sometimes it captures something like:
"Skip to main content\n\nCookies \n------------------------------\n\nWe use some essential cookies to make this service work.\n\nWe\u2019d also like to use analytics cookies so we can understand how you use the service and make improvements.\n\nAccept analytics cookies Reject analytics cookies How we use cookies\n\nYou can change your cookie settings\n at any time.\n\nHide cookie message\n\n"
Huge! @tractorjuice can you send us an example of an url where this shows up?
Good example sites are:
https://www.advent-im.co.uk https://baserock.co.uk https://aaseya.com/
@tractorjuice Thanks! That's very helpful.
Actually, cookie banners are preventing the crawler from successfully accessing and crawling certain websites at all. This problem has been observed on multiple sites, including both public institutions and news websites. Eg: Public Institution: https://www.salzburg.gv.at/ Newspaper: https://www.derstandard.at/ When the crawler attempts to access these sites, it encounters cookie consent banners that block further actions. As a result, the crawler is unable to navigate past the initial page and cannot gather any content from the website.
We just pushed a removeTags feature that helps with this! Closing this for now!