firecrawl icon indicating copy to clipboard operation
firecrawl copied to clipboard

[Feat] Add an option to exclude images in the final markdown

Open calebpeffer opened this issue 10 months ago • 8 comments

Add an option to the /scrape, /crawl, and /search endpoints to not include images in the final markdown - per user request via email.

@nickscamara @rafaelsideguide - We may want to think of a simpler way for users to customize the final markdown without us having to add an API parameter for each.

calebpeffer avatar Apr 26 '24 23:04 calebpeffer

Maybe add an array of regex, that affects the markdown parsing?

nickscamara avatar May 06 '24 18:05 nickscamara

Maybe add an array of html semantic tags (less complicated but less customizable)?

nickscamara avatar May 06 '24 18:05 nickscamara

What about both?^

nickscamara avatar May 06 '24 18:05 nickscamara

What about excluding via xpath or css selectors?

nickscamara avatar May 06 '24 18:05 nickscamara

@nickscamara Is the idea to filter late, filter the markdown before returning? Or early, filter the html, before creating the markdown?

mattjoyce avatar May 27 '24 04:05 mattjoyce

@mattjoyce Hmm. not sure yet. Ideally this should not even be in Firecrawl as people can just regex it after getting the markdown, but we have been getting so many requests that might makes sense to implement it.

nickscamara avatar May 28 '24 18:05 nickscamara

To some degree the expectation is set as firecrawl does offer filtering, 'only main content' is a form of filtering out, and returning only links is filtering in.

It would useful to understand the use case a bit. In my case I simply want to reduce the volume of returned matter. Links, especially with external URLs are a significant bloat.

I would support a nolinks option. Strip all reference and image links from markdown.

Noting that firecrawl is associated with LLM use, and therefore has to be sensitive to token count.
For instance including ways to reduce prompt tokens in Scrape reduces costs and increases LLM response.

Inevitably, folk will want to use other LLMs which may have smaller context windows.

mattjoyce avatar May 28 '24 21:05 mattjoyce

Makes a lot of sense @mattjoyce. Agreed. We should do it.

nickscamara avatar May 28 '24 21:05 nickscamara

@nickscamara , I was thinking about this and poking about in the code. Given that it will be another option, there are a couple of choices to land.

  1. Remove the entire link and title.
  2. Remove the link and leave the title.

I feel there is a reasonable case for both, but leaving the title is the better as it does not damage the content.

As to where this adjustment could/should be made.... It would be a fairly simple turndown rule in the HTML to Markdown process. Or use a regex before that process. A turndown rule seems like an entirely appropriate use of that library, but perhaps not quite the separation of concerns you want.

thoughts?

fwiw, I have a turndown patch I will test this week.

mattjoyce avatar Jun 04 '24 09:06 mattjoyce

My initial thoughts is removing the whole thing (titles, alt tags, href, src), both for images and links.

Btw, we had like 5 requests for this issue in the past few days, I think it makes sense to priotize it. @mattjoyce feel free to tackle it if you want. If not, I will try to work with @rafaelsideguide on it tomorrow.

nickscamara avatar Jun 06 '24 20:06 nickscamara

OK, I have added a pageOption attribute noLinks:boolean This triggers a turndown rule which removes href and src links, but leave the titles. I tried with removing everything, but it caused too much damage, for instance on pages that have lists of product items, they are often like this.

the whole list got removed. So now they will end up like this

  • product1
  • product2
  • product3

mattjoyce avatar Jun 07 '24 11:06 mattjoyce

Pull Request here : https://github.com/mendableai/firecrawl/pull/251

mattjoyce avatar Jun 07 '24 11:06 mattjoyce

#273 solves this issue. Now you can use pageOptions.removeTags: ['img'] or specify any specific id or class.

rafaelsideguide avatar Jun 14 '24 16:06 rafaelsideguide