[Feature]: How to have >100 e.g. 10K or more seeds in a "list of pages".
What change would you like to see?
We would like to have the possibility to increase the maxnumber of seeds or Urls in a list of pages. Today the maxlimit is hardwired to 100. see screen dump here:
Context
see above
Yes, this is something we'd like to support, including unlimited number of page URLs. We can raise this limit a bit, but need to update how we store the large list on the backend. Perhaps we'd add support for uploading a text file instead of entering URLs in the textbox here, and just store the file in the S3 bucket.
Yup +1 for this, I routinely have to run crawls with 10k+ URLs. My current workaround is to split the list of URLs into batches of crawls with 25 URLs each using the API. It works well for the actual crawling, but unfortunately viewing a collection with too many crawls in it seems to crash the frontend container and shows "no pages found" when trying to replay.
# expects a file urls.txt in the current working directory containing all URLs
import requests
# --------------------------------------------------------
# Configuration
# --------------------------------------------------------
BROWSERTIX_API_BASE = "http://browsertrix.example.com"
ORG_ID = "your-browsertrix-org-id" # Replace with your actual org UUID
AUTH_TOKEN = "your-browsertrix-auth-token" # Replace with your actual auth token
PROFILE_ID = "your-browsertrix-profile-id" # Replace with your actual browser profile UUID
COLLECTION_ID = "your-browsertrix-collection-id" # Replace with your actual collection UUID
# The endpoint to create a new crawl config is:
# POST /api/orgs/{oid}/crawlconfigs/
# --------------------------------------------------------
def chunker(seq, size):
"""
Generator to yield successive chunks of a given list (seq) of the given size.
"""
for pos in range(0, len(seq), size):
yield seq[pos : pos + size]
def main():
# 1) Read in all URLs from urls.txt (assumes exactly 10,000 lines)
with open("urls.txt", "r", encoding="utf-8") as f:
all_urls = [line.strip() for line in f if line.strip()]
# 2) Split URLs into chunks of 25
chunk_size = 25
chunks = list(chunker(all_urls, chunk_size))
# 3) For each chunk, create a new crawl config and set "runNow": True
# This will instruct Browsertrix to start the crawl immediately.
headers = {
"Authorization": f"Bearer {AUTH_TOKEN}",
"Content-Type": "application/json",
}
for i, urls_subset in enumerate(chunks, start=1):
if i == 1:
continue
# Prepare the seeds block
seeds = [{"url": u} for u in urls_subset]
# Body matches the CrawlConfigIn schema
payload = {
"name": f"BulkCrawl #{i}",
"runNow": True,
"jobType": "url-list",
"profileid": PROFILE_ID,
"tags": [ "labelstudio" ],
"autoAddCollections": [ COLLECTION_ID ],
"config": {
"seeds": seeds,
"scopeType": "page",
"workers": 4,
"postLoadDelay": 5,
# You can set other config fields here if needed:
# "blockAds": True,
# "useSitemap": True,
# etc.
},
}
url = f"{BROWSERTIX_API_BASE}/api/orgs/{ORG_ID}/crawlconfigs/"
resp = requests.post(url, json=payload, headers=headers)
if resp.status_code == 200:
data = resp.json()
print(f"Created crawl config {i} successfully. ID={data.get('id')}")
else:
print(
f"Error creating crawl config {i}. "
f"HTTP {resp.status_code}. Response: {resp.text}"
)
if __name__ == "__main__":
main()
@tw4l To create feature document as first step, likely implementation involves uploading list as a file that crawler can download
Supported in 1.18, which will be released shortly!
Documentation on how to specify more than 100 seeds by using a file: https://docs.browsertrix.com/user-guide/workflow-setup/#list-of-pages