firecrawl icon indicating copy to clipboard operation
firecrawl copied to clipboard

Remove 'cookies' text when removing headers/footers, etc

Open tractorjuice opened this issue 10 months ago • 4 comments

Remove any cookies text when removing headers and footers. Many sites in Europe will display a cookie acceptance message Sometimes, this is the only text returned.

Sometimes it captures something like:

"Skip to main content\n\nCookies \n------------------------------\n\nWe use some essential cookies to make this service work.\n\nWe\u2019d also like to use analytics cookies so we can understand how you use the service and make improvements.\n\nAccept analytics cookies Reject analytics cookies How we use cookies\n\nYou can change your cookie settings\n at any time.\n\nHide cookie message\n\n"

tractorjuice avatar Apr 27 '24 10:04 tractorjuice

Huge! @tractorjuice can you send us an example of an url where this shows up?

nickscamara avatar Apr 27 '24 18:04 nickscamara

Good example sites are:

https://www.advent-im.co.uk https://baserock.co.uk https://aaseya.com/

tractorjuice avatar Apr 28 '24 07:04 tractorjuice

@tractorjuice Thanks! That's very helpful.

nickscamara avatar Apr 28 '24 19:04 nickscamara

Actually, cookie banners are preventing the crawler from successfully accessing and crawling certain websites at all. This problem has been observed on multiple sites, including both public institutions and news websites. Eg: Public Institution: https://www.salzburg.gv.at/ Newspaper: https://www.derstandard.at/ When the crawler attempts to access these sites, it encounters cookie consent banners that block further actions. As a result, the crawler is unable to navigate past the initial page and cannot gather any content from the website.

fhederdos avatar May 17 '24 09:05 fhederdos

We just pushed a removeTags feature that helps with this! Closing this for now!

nickscamara avatar Jun 13 '24 20:06 nickscamara