firecrawl icon indicating copy to clipboard operation
firecrawl copied to clipboard

[Feat] Strip non-content tags, headers, footers

Open oliviermills opened this issue 10 months ago • 5 comments

The markdown would be much more useful if you stripped headers/footers and other tags like filters etc that is not core content (i.e. low value for RAG/context). Either using tag or class-based removal from the html or using something like Mozilla's Readability or both! Highly opinionated class-based removal is risky but produces high value content and less noise.

For example a language selector in a header gets produced and should be stripped:

[Skip to main content](#main-content)

Select LanguageEnglishAfrikaansAlbanianArabicArmenianAzerbaijaniBasqueBelarusianBengaliBosnianBulgarianCatalanCebuanoChinese (Simplified)Chinese (Traditional)CroatianCzechDanishDutchEsperantoEstonianFilipinoFinnishFrenchGalicianGeorgianGermanGreekGujaratiHaitian CreoleHausaHebrewHindiHmongHungarianIcelandicIgboIndonesianIrishItalianJapaneseJavaneseKannadaKhmerKoreanLaoLatinLatvianLithuanianMacedonianMalayMalteseMaoriMarathiMongolianNepaliNorwegianPersianPolishPortuguesePunjabiRomanianRussianSerbianSlovakSlovenianSomaliSpanishSwahiliSwedishTamilTeluguThaiTurkishUkrainianUrduVietnameseWelshYiddishYorubaZulu

Here is a starter list.. should probably test against a couple thousand random pages and use an LLM like haiku with vision as judge.

const exclude = [
  'header', '.header', '.top', '.navbar', '#header',
  'footer', '.footer', '.bottom', '#footer',
  '.sidebar', '.side', '.aside', '#sidebar',
  '.modal', '.popup', '#modal', '.overlay',
  '.ad', '.ads', '.advert', '#ad',
  '.lang-selector', '.language', '#language-selector',
  '.social', '.social-media', '.social-links', '#social',
  '.menu', '.navigation', 'nav', '#nav',
  '.breadcrumbs', '#breadcrumbs',
  '.form', 'form', '#search-form',
  'script', 'noscript'
];

oliviermills avatar Apr 16 '24 18:04 oliviermills

So, we've defaulted towards removing less, because (like you said) highly opinionated removal is risky and its easy to do further cleaning on the output with regex.

Like the idea of readability as an option. Great suggestion!

calebpeffer avatar Apr 16 '24 19:04 calebpeffer

@oliviermills thank you for this. Just merged an option to remove non content tags. #14

This is just a start and I think there is room for other improvements here.

nickscamara avatar Apr 18 '24 01:04 nickscamara

Let me know if you have any feedback!

nickscamara avatar Apr 18 '24 01:04 nickscamara

I suggest a cleaner function per my PR #16 .. its slightly less aggressive but needs integration testing (#15) to see if it affects the md conversion. I checked turndown and any customizations within the code base here and it doesn't use style so that should be ok.

oliviermills avatar Apr 18 '24 03:04 oliviermills

Awesome, thanks @oliviermills! Will be checking it out soon.

nickscamara avatar Apr 18 '24 16:04 nickscamara

Closing this one (#273 solves this issue).

rafaelsideguide avatar Jun 14 '24 12:06 rafaelsideguide