firecrawl
firecrawl copied to clipboard
[Feat] Strip non-content tags, headers, footers
The markdown would be much more useful if you stripped headers/footers and other tags like filters etc that is not core content (i.e. low value for RAG/context). Either using tag or class-based removal from the html or using something like Mozilla's Readability or both! Highly opinionated class-based removal is risky but produces high value content and less noise.
For example a language selector in a header gets produced and should be stripped:
[Skip to main content](#main-content)
Select LanguageEnglishAfrikaansAlbanianArabicArmenianAzerbaijaniBasqueBelarusianBengaliBosnianBulgarianCatalanCebuanoChinese (Simplified)Chinese (Traditional)CroatianCzechDanishDutchEsperantoEstonianFilipinoFinnishFrenchGalicianGeorgianGermanGreekGujaratiHaitian CreoleHausaHebrewHindiHmongHungarianIcelandicIgboIndonesianIrishItalianJapaneseJavaneseKannadaKhmerKoreanLaoLatinLatvianLithuanianMacedonianMalayMalteseMaoriMarathiMongolianNepaliNorwegianPersianPolishPortuguesePunjabiRomanianRussianSerbianSlovakSlovenianSomaliSpanishSwahiliSwedishTamilTeluguThaiTurkishUkrainianUrduVietnameseWelshYiddishYorubaZulu
Here is a starter list.. should probably test against a couple thousand random pages and use an LLM like haiku with vision as judge.
const exclude = [
'header', '.header', '.top', '.navbar', '#header',
'footer', '.footer', '.bottom', '#footer',
'.sidebar', '.side', '.aside', '#sidebar',
'.modal', '.popup', '#modal', '.overlay',
'.ad', '.ads', '.advert', '#ad',
'.lang-selector', '.language', '#language-selector',
'.social', '.social-media', '.social-links', '#social',
'.menu', '.navigation', 'nav', '#nav',
'.breadcrumbs', '#breadcrumbs',
'.form', 'form', '#search-form',
'script', 'noscript'
];
So, we've defaulted towards removing less, because (like you said) highly opinionated removal is risky and its easy to do further cleaning on the output with regex.
Like the idea of readability as an option. Great suggestion!
@oliviermills thank you for this. Just merged an option to remove non content tags. #14
This is just a start and I think there is room for other improvements here.
Let me know if you have any feedback!
I suggest a cleaner function per my PR #16 .. its slightly less aggressive but needs integration testing (#15) to see if it affects the md conversion. I checked turndown and any customizations within the code base here and it doesn't use style so that should be ok.
Awesome, thanks @oliviermills! Will be checking it out soon.
Closing this one (#273 solves this issue).