firecrawl icon indicating copy to clipboard operation
firecrawl copied to clipboard

[BUG] crawlerOptions.limit is not respected when returnOnlyUrls is set to false

Open nickscamara opened this issue 1 year ago • 2 comments

[BUG] crawlerOptions.limit is not respected when returnOnlyUrls is set to false

nickscamara avatar Jul 13 '24 02:07 nickscamara

@nickscamara I'm not able to reproduce this.

I tested with https://www.lsu.edu/majors for these parameters:

limit returnOnlyUrls Retrieved URLs
0 True 0
10 True 10
50 True 50
100 True 100
200 True 142
None True 142
0 False 0
10 False 10
50 False 50
100 False 100
200 False 142
None False 142

Code used for testing:

app = FirecrawlApp(api_key=FIRECRAWL_API_KEY)
start_time = time.time()
crawl_url = "https://www.lsu.edu/majors"

limits = [0, 10, 50, 100, 200, None]
returnOnlyUrls = [True, False]
urls_crawled = []

for j in range(0, len(returnOnlyUrls)):
    for i in range(0, len(limits)):
        params = {
            "crawlerOptions": {},
        }

        if limits[i] != None:
            params['crawlerOptions']['limit'] = limits[i]
        
        if returnOnlyUrls[j] == True:
            params['crawlerOptions']['returnOnlyUrls'] = True

        urls = []
        crawl_request = app.crawl_url(crawl_url, params=params, wait_until_done=False)

        job_id = crawl_request['jobId']
        status = app.check_crawl_status(job_id)

        while status['status'] == 'active':
            status = app.check_crawl_status(job_id)
            time.sleep(2)
        
        time.sleep(5) # wait for the data to be saved at the db
        status = app.check_crawl_status(job_id)
        urls_crawled.append({
            "limit": limits[i],
            "returnOnlyUrls": returnOnlyUrls[j],
            "num_urls": len(status['data'])
        })
        
for k in range(0, len(urls_crawled)):
    print(urls_crawled[k])

rafaelsideguide avatar Aug 02 '24 12:08 rafaelsideguide

Hello,

I just built the Docker image today and am running locally. With the above code (I omitted the None option) I get a different result:

Limit Return Only URLs Number of URLs
0 True 1
10 True 1
50 True 1
100 True 1
200 True 1
0 False 548
10 False 279
50 False 279
100 False 548
200 False 279

For what it's worth, I also tried the test in #435 , regarding depth and limit, with similarly confusing results:

Limit MaxDepth 0 MaxDepth 2 MaxDepth 5
0 1 5 549
10 1 5 279
100 1 5 279
200 1 5 549
500 1 5 279

Could there be an issue when deploying locally?

lawtj avatar Aug 27 '24 21:08 lawtj

Found the bug:

Previously, the env.local file had the variables set like this:

SCRAPING_BEE_API_KEY=# set if you'd like to use scraping Be to handle JS blocking

he issue is that with Docker, the comment (after =) is being interpreted as the variable value. This causes it to pass through conditions in the scraper strategies (e.g.,https://github.com/mendableai/firecrawl/blob/main/apps/api/src/scraper/WebScraper/single_url.ts#L268).

To fix this, update the environment variable as follows:

# set if you'd like to use scraping Be to handle JS blocking
SCRAPING_BEE_API_KEY=

This resolves the issue.

Code used for testing:

app = FirecrawlApp(api_key="fc-123", api_url="http://localhost:3002")
start_time = time.time()
crawl_url = "https://www.lsu.edu/majors"

limits = [0, 10, 50, 100, 200, None]
withFormatLinks = [True, False]
maxDepth = [0, 2, 5, None]
urls_crawled = []

for k in range(0, len(maxDepth)):
    for j in range(0, len(withFormatLinks)):
        for i in range(0, len(limits)):
            params = {
                "scrapeOptions": { "formats": ["markdown"] }
            }

            if limits[i] != None:
                params['limit'] = limits[i]

            if withFormatLinks[j] == True:
                params['scrapeOptions']['formats'] = ["links", "markdown"]

            if maxDepth[k] != None:
                params['maxDepth'] = maxDepth[k]

            urls = []
            crawl_request = app.async_crawl_url(crawl_url, params=params)

            job_id = crawl_request['id']
            status = app.check_crawl_status(job_id)

            while status['status'] != 'completed':
                status = app.check_crawl_status(job_id)
                time.sleep(2)

            time.sleep(5) # wait for the data to be saved at the db
            status = app.check_crawl_status(job_id)
            urls_crawled.append({
                "limit": limits[i],
                "maxDepth": maxDepth[k],
                "with_format_links": withFormatLinks[j],
                "num_urls": len(status['data'])
            })
        
for k in range(0, len(urls_crawled)):
    print(urls_crawled[k])

The results:

limit maxDepth with_format_links num_urls
0 0 True 1
10 0 True 1
50 0 True 1
100 0 True 1
200 0 True 1
None 0 True 1
0 0 False 1
10 0 False 1
50 0 False 1
100 0 False 1
200 0 False 1
None 0 False 1
0 2 True 1
10 2 True 5
50 2 True 5
100 2 True 5
200 2 True 5
None 2 True 5
0 2 False 1
10 2 False 5
50 2 False 5
100 2 False 5
200 2 False 5
None 2 False 5
0 5 True 1
10 5 True 10
50 5 True 50
100 5 True 100
200 5 True 147
None 5 True 147
0 5 False 1
10 5 False 10
50 5 False 52
100 5 False 100
200 5 False 147
None 5 False 147
0 None True 1
10 None True 10
50 None True 50
100 None True 100
200 None True 147
None None True 147
0 None False 1
10 None False 10
50 None False 50
100 None False 100
200 None False 147
None None False 147

rafaelsideguide avatar Oct 24 '24 21:10 rafaelsideguide