Matt Joyce comments

Results 91 comments of


                                            Matt Joyce

[sdk] ci pipeline for publishing python/node sdk

ok, here is a draft, or example of how the script would be used to - completely theoretical. ``` name: CI/CD on: push: branches: - main jobs: build-and-publish: runs-on: ubuntu-latest...

[sdk] Timeout param in crawl is confusing

The purpose of the parameter is to set the period between checks, so...(check, poll, status) x ( period, interval, frequency) check-period check-interval check-frequency poll-period **poll-interval** poll-frequency status-period status-interval status-frequency

[BUG] Search is returning "no page found" on self-host

Can confirm, this happens to me too, with python-sdk. However, only happens when I provide Scrape options. this works : `scraped_data = app.scrape_url(url)` this fails : ``` params = {...

[BUG] Search is returning "no page found" on self-host

Ok, so this does not work. ``` payload = { "url": "https://www.humblebundle.com/", "pageOptions": { "includeHtml": True, "waitFor": 123, "onlyMainContent": True, } } ``` but this does. ``` payload = {...

[BUG] Search is returning "no page found" on self-host

@rafaelsideguide , I see a lot of these, every scrape actually. > api-1 | Error fetching w/ playwright server -> URL: https://www.humblebundle.com/ with status: 404 and 'Page Cannot Be Found'...

[BUG] Search is returning "no page found" on self-host

@rafaelsideguide yes, pretty sure that is fixed by that patch. ``` app = FirecrawlApp(api_url="http://localhost:3002") url = "https://www.humblebundle.com" params={"pageOptions": { "includeHtml": False, "waitFor": 123, "onlyMainContent": True,}} scraped_data = app.scrape_url(url,params=params) ``` Works...

[Feat] Scrape markdown response ideally is splitted in seperate chunks / parts

@ChrisMeye , the idea has merit, but the problem is there are a lot of different chunking strategies, and the right one is going to be dependent on the specific...

[Feat] Scrape markdown response ideally is splitted in seperate chunks / parts

Interesting reading [5_Levels_Of_Text_Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) [Embedding short and long content](https://www.pinecone.io/learn/chunking-strategies/)

[Feat] Abstract the LLM Extraction to allow for other models and a model providers

Potentially cleave this part of to a micro service and provide a way to have alternate endpoint.

[Feat] Abstract the LLM Extraction to allow for other models and a model providers

Interesting, relevant and very active project : https://github.com/jxnl/instructor