[BUG] crawlerOptions.limit is not respected when returnOnlyUrls is set to false
[BUG] crawlerOptions.limit is not respected when returnOnlyUrls is set to false
@nickscamara I'm not able to reproduce this.
I tested with https://www.lsu.edu/majors for these parameters:
| limit | returnOnlyUrls | Retrieved URLs |
|---|---|---|
| 0 | True | 0 |
| 10 | True | 10 |
| 50 | True | 50 |
| 100 | True | 100 |
| 200 | True | 142 |
| None | True | 142 |
| 0 | False | 0 |
| 10 | False | 10 |
| 50 | False | 50 |
| 100 | False | 100 |
| 200 | False | 142 |
| None | False | 142 |
Code used for testing:
app = FirecrawlApp(api_key=FIRECRAWL_API_KEY)
start_time = time.time()
crawl_url = "https://www.lsu.edu/majors"
limits = [0, 10, 50, 100, 200, None]
returnOnlyUrls = [True, False]
urls_crawled = []
for j in range(0, len(returnOnlyUrls)):
for i in range(0, len(limits)):
params = {
"crawlerOptions": {},
}
if limits[i] != None:
params['crawlerOptions']['limit'] = limits[i]
if returnOnlyUrls[j] == True:
params['crawlerOptions']['returnOnlyUrls'] = True
urls = []
crawl_request = app.crawl_url(crawl_url, params=params, wait_until_done=False)
job_id = crawl_request['jobId']
status = app.check_crawl_status(job_id)
while status['status'] == 'active':
status = app.check_crawl_status(job_id)
time.sleep(2)
time.sleep(5) # wait for the data to be saved at the db
status = app.check_crawl_status(job_id)
urls_crawled.append({
"limit": limits[i],
"returnOnlyUrls": returnOnlyUrls[j],
"num_urls": len(status['data'])
})
for k in range(0, len(urls_crawled)):
print(urls_crawled[k])
Hello,
I just built the Docker image today and am running locally. With the above code (I omitted the None option) I get a different result:
| Limit | Return Only URLs | Number of URLs |
|---|---|---|
| 0 | True | 1 |
| 10 | True | 1 |
| 50 | True | 1 |
| 100 | True | 1 |
| 200 | True | 1 |
| 0 | False | 548 |
| 10 | False | 279 |
| 50 | False | 279 |
| 100 | False | 548 |
| 200 | False | 279 |
For what it's worth, I also tried the test in #435 , regarding depth and limit, with similarly confusing results:
| Limit | MaxDepth 0 | MaxDepth 2 | MaxDepth 5 |
|---|---|---|---|
| 0 | 1 | 5 | 549 |
| 10 | 1 | 5 | 279 |
| 100 | 1 | 5 | 279 |
| 200 | 1 | 5 | 549 |
| 500 | 1 | 5 | 279 |
Could there be an issue when deploying locally?
Found the bug:
Previously, the env.local file had the variables set like this:
SCRAPING_BEE_API_KEY=# set if you'd like to use scraping Be to handle JS blocking
he issue is that with Docker, the comment (after =) is being interpreted as the variable value. This causes it to pass through conditions in the scraper strategies (e.g.,https://github.com/mendableai/firecrawl/blob/main/apps/api/src/scraper/WebScraper/single_url.ts#L268).
To fix this, update the environment variable as follows:
# set if you'd like to use scraping Be to handle JS blocking
SCRAPING_BEE_API_KEY=
This resolves the issue.
Code used for testing:
app = FirecrawlApp(api_key="fc-123", api_url="http://localhost:3002")
start_time = time.time()
crawl_url = "https://www.lsu.edu/majors"
limits = [0, 10, 50, 100, 200, None]
withFormatLinks = [True, False]
maxDepth = [0, 2, 5, None]
urls_crawled = []
for k in range(0, len(maxDepth)):
for j in range(0, len(withFormatLinks)):
for i in range(0, len(limits)):
params = {
"scrapeOptions": { "formats": ["markdown"] }
}
if limits[i] != None:
params['limit'] = limits[i]
if withFormatLinks[j] == True:
params['scrapeOptions']['formats'] = ["links", "markdown"]
if maxDepth[k] != None:
params['maxDepth'] = maxDepth[k]
urls = []
crawl_request = app.async_crawl_url(crawl_url, params=params)
job_id = crawl_request['id']
status = app.check_crawl_status(job_id)
while status['status'] != 'completed':
status = app.check_crawl_status(job_id)
time.sleep(2)
time.sleep(5) # wait for the data to be saved at the db
status = app.check_crawl_status(job_id)
urls_crawled.append({
"limit": limits[i],
"maxDepth": maxDepth[k],
"with_format_links": withFormatLinks[j],
"num_urls": len(status['data'])
})
for k in range(0, len(urls_crawled)):
print(urls_crawled[k])
The results:
| limit | maxDepth | with_format_links | num_urls |
|---|---|---|---|
| 0 | 0 | True | 1 |
| 10 | 0 | True | 1 |
| 50 | 0 | True | 1 |
| 100 | 0 | True | 1 |
| 200 | 0 | True | 1 |
| None | 0 | True | 1 |
| 0 | 0 | False | 1 |
| 10 | 0 | False | 1 |
| 50 | 0 | False | 1 |
| 100 | 0 | False | 1 |
| 200 | 0 | False | 1 |
| None | 0 | False | 1 |
| 0 | 2 | True | 1 |
| 10 | 2 | True | 5 |
| 50 | 2 | True | 5 |
| 100 | 2 | True | 5 |
| 200 | 2 | True | 5 |
| None | 2 | True | 5 |
| 0 | 2 | False | 1 |
| 10 | 2 | False | 5 |
| 50 | 2 | False | 5 |
| 100 | 2 | False | 5 |
| 200 | 2 | False | 5 |
| None | 2 | False | 5 |
| 0 | 5 | True | 1 |
| 10 | 5 | True | 10 |
| 50 | 5 | True | 50 |
| 100 | 5 | True | 100 |
| 200 | 5 | True | 147 |
| None | 5 | True | 147 |
| 0 | 5 | False | 1 |
| 10 | 5 | False | 10 |
| 50 | 5 | False | 52 |
| 100 | 5 | False | 100 |
| 200 | 5 | False | 147 |
| None | 5 | False | 147 |
| 0 | None | True | 1 |
| 10 | None | True | 10 |
| 50 | None | True | 50 |
| 100 | None | True | 100 |
| 200 | None | True | 147 |
| None | None | True | 147 |
| 0 | None | False | 1 |
| 10 | None | False | 10 |
| 50 | None | False | 50 |
| 100 | None | False | 100 |
| 200 | None | False | 147 |
| None | None | False | 147 |