firecrawl [BUG] crawlerOptions.limit is not respected when returnOnlyUrls is set to false

[BUG] crawlerOptions.limit is not respected when returnOnlyUrls is set to false

Jul 13 '24 02:07 nickscamara

@nickscamara I'm not able to reproduce this.

I tested with https://www.lsu.edu/majors for these parameters:

limit	returnOnlyUrls	Retrieved URLs
0	True	0
10	True	10
50	True	50
100	True	100
200	True	142
None	True	142
0	False	0
10	False	10
50	False	50
100	False	100
200	False	142
None	False	142

Code used for testing:

app = FirecrawlApp(api_key=FIRECRAWL_API_KEY)
start_time = time.time()
crawl_url = "https://www.lsu.edu/majors"

limits = [0, 10, 50, 100, 200, None]
returnOnlyUrls = [True, False]
urls_crawled = []

for j in range(0, len(returnOnlyUrls)):
    for i in range(0, len(limits)):
        params = {
            "crawlerOptions": {},
        }

        if limits[i] != None:
            params['crawlerOptions']['limit'] = limits[i]
        
        if returnOnlyUrls[j] == True:
            params['crawlerOptions']['returnOnlyUrls'] = True

        urls = []
        crawl_request = app.crawl_url(crawl_url, params=params, wait_until_done=False)

        job_id = crawl_request['jobId']
        status = app.check_crawl_status(job_id)

        while status['status'] == 'active':
            status = app.check_crawl_status(job_id)
            time.sleep(2)
        
        time.sleep(5) # wait for the data to be saved at the db
        status = app.check_crawl_status(job_id)
        urls_crawled.append({
            "limit": limits[i],
            "returnOnlyUrls": returnOnlyUrls[j],
            "num_urls": len(status['data'])
        })
        
for k in range(0, len(urls_crawled)):
    print(urls_crawled[k])

Aug 02 '24 12:08 rafaelsideguide

Hello,

I just built the Docker image today and am running locally. With the above code (I omitted the None option) I get a different result:

Limit	Return Only URLs	Number of URLs
0	True	1
10	True	1
50	True	1
100	True	1
200	True	1
0	False	548
10	False	279
50	False	279
100	False	548
200	False	279

For what it's worth, I also tried the test in #435 , regarding depth and limit, with similarly confusing results:

Limit	MaxDepth 0	MaxDepth 2	MaxDepth 5
0	1	5	549
10	1	5	279
100	1	5	279
200	1	5	549
500	1	5	279

Could there be an issue when deploying locally?

Aug 27 '24 21:08 lawtj

Found the bug:

Previously, the env.local file had the variables set like this:

SCRAPING_BEE_API_KEY=# set if you'd like to use scraping Be to handle JS blocking

he issue is that with Docker, the comment (after =) is being interpreted as the variable value. This causes it to pass through conditions in the scraper strategies (e.g.,https://github.com/mendableai/firecrawl/blob/main/apps/api/src/scraper/WebScraper/single_url.ts#L268).

To fix this, update the environment variable as follows:

# set if you'd like to use scraping Be to handle JS blocking
SCRAPING_BEE_API_KEY=

This resolves the issue.

Code used for testing:

app = FirecrawlApp(api_key="fc-123", api_url="http://localhost:3002")
start_time = time.time()
crawl_url = "https://www.lsu.edu/majors"

limits = [0, 10, 50, 100, 200, None]
withFormatLinks = [True, False]
maxDepth = [0, 2, 5, None]
urls_crawled = []

for k in range(0, len(maxDepth)):
    for j in range(0, len(withFormatLinks)):
        for i in range(0, len(limits)):
            params = {
                "scrapeOptions": { "formats": ["markdown"] }
            }

            if limits[i] != None:
                params['limit'] = limits[i]

            if withFormatLinks[j] == True:
                params['scrapeOptions']['formats'] = ["links", "markdown"]

            if maxDepth[k] != None:
                params['maxDepth'] = maxDepth[k]

            urls = []
            crawl_request = app.async_crawl_url(crawl_url, params=params)

            job_id = crawl_request['id']
            status = app.check_crawl_status(job_id)

            while status['status'] != 'completed':
                status = app.check_crawl_status(job_id)
                time.sleep(2)

            time.sleep(5) # wait for the data to be saved at the db
            status = app.check_crawl_status(job_id)
            urls_crawled.append({
                "limit": limits[i],
                "maxDepth": maxDepth[k],
                "with_format_links": withFormatLinks[j],
                "num_urls": len(status['data'])
            })
        
for k in range(0, len(urls_crawled)):
    print(urls_crawled[k])

The results:

limit	maxDepth	with_format_links	num_urls
0	0	True	1
10	0	True	1
50	0	True	1
100	0	True	1
200	0	True	1
None	0	True	1
0	0	False	1
10	0	False	1
50	0	False	1
100	0	False	1
200	0	False	1
None	0	False	1
0	2	True	1
10	2	True	5
50	2	True	5
100	2	True	5
200	2	True	5
None	2	True	5
0	2	False	1
10	2	False	5
50	2	False	5
100	2	False	5
200	2	False	5
None	2	False	5
0	5	True	1
10	5	True	10
50	5	True	50
100	5	True	100
200	5	True	147
None	5	True	147
0	5	False	1
10	5	False	10
50	5	False	52
100	5	False	100
200	5	False	147
None	5	False	147
0	None	True	1
10	None	True	10
50	None	True	50
100	None	True	100
200	None	True	147
None	None	True	147
0	None	False	1
10	None	False	10
50	None	False	50
100	None	False	100
200	None	False	147
None	None	False	147

Oct 24 '24 21:10 rafaelsideguide