firecrawl
firecrawl copied to clipboard
[Feat] Add an option to exclude images in the final markdown
Add an option to the /scrape, /crawl, and /search endpoints to not include images in the final markdown - per user request via email.
@nickscamara @rafaelsideguide - We may want to think of a simpler way for users to customize the final markdown without us having to add an API parameter for each.
Maybe add an array of regex, that affects the markdown parsing?
Maybe add an array of html semantic tags (less complicated but less customizable)?
What about both?^
What about excluding via xpath or css selectors?
@nickscamara Is the idea to filter late, filter the markdown before returning? Or early, filter the html, before creating the markdown?
@mattjoyce Hmm. not sure yet. Ideally this should not even be in Firecrawl as people can just regex it after getting the markdown, but we have been getting so many requests that might makes sense to implement it.
To some degree the expectation is set as firecrawl does offer filtering, 'only main content' is a form of filtering out, and returning only links is filtering in.
It would useful to understand the use case a bit. In my case I simply want to reduce the volume of returned matter. Links, especially with external URLs are a significant bloat.
I would support a nolinks option. Strip all reference and image links from markdown.
Noting that firecrawl is associated with LLM use, and therefore has to be sensitive to token count.
For instance including ways to reduce prompt tokens in Scrape reduces costs and increases LLM response.
Inevitably, folk will want to use other LLMs which may have smaller context windows.
Makes a lot of sense @mattjoyce. Agreed. We should do it.
@nickscamara , I was thinking about this and poking about in the code. Given that it will be another option, there are a couple of choices to land.
- Remove the entire link and title.
- Remove the link and leave the title.
I feel there is a reasonable case for both, but leaving the title is the better as it does not damage the content.
As to where this adjustment could/should be made.... It would be a fairly simple turndown rule in the HTML to Markdown process. Or use a regex before that process. A turndown rule seems like an entirely appropriate use of that library, but perhaps not quite the separation of concerns you want.
thoughts?
fwiw, I have a turndown patch I will test this week.
My initial thoughts is removing the whole thing (titles, alt tags, href, src), both for images and links.
Btw, we had like 5 requests for this issue in the past few days, I think it makes sense to priotize it. @mattjoyce feel free to tackle it if you want. If not, I will try to work with @rafaelsideguide on it tomorrow.
OK, I have added a pageOption attribute noLinks:boolean This triggers a turndown rule which removes href and src links, but leave the titles. I tried with removing everything, but it caused too much damage, for instance on pages that have lists of product items, they are often like this.
the whole list got removed. So now they will end up like this
- product1
- product2
- product3
Pull Request here : https://github.com/mendableai/firecrawl/pull/251
#273 solves this issue. Now you can use pageOptions.removeTags: ['img']
or specify any specific id or class.