firecrawl
firecrawl copied to clipboard
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Replaced the exclude tag list with a function that does nicer and safer clean up. Resolves #1 Added basics tests for the function. _Important_: should add an **integration** test with...
Add the ability to filter related websites by regex, for instance:”https://www.archdaily.com/1015605/bandhan-residential-school-of-business-abin-design-studio“
When scraping, and mostly crawling, provide the ability to have all relative urls changed to absolute urls (for further processing or link extraction). Eg. `[The PDF file][/assets/file.pdf]` => `[The PDF...
Consider adding haiku or replacing with haiku for image in [utils/gptVision.ts](https://github.com/mendableai/firecrawl/blob/main/apps/api/src/scraper/WebScraper/utils/gptVision.ts) The same prompt will work well. Also you should probably shift to the now `gpt-4-turbo` which [recommended](https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4) instead of...
The markdown would be much more useful if you stripped headers/footers and other tags like filters etc that is not core content (i.e. low value for RAG/context). Either using tag...
In tweaking and growing the html clean up and html-to-md. I highly recommend adding integration tests using either live webpages (to test also the get/network and dynamic websites) OR at...
These are viktor-invented categories. [_source_](https://github.com/szepeviktor/debian-server-tools/blob/master/.gitignore#L15)