browsertrix icon indicating copy to clipboard operation
browsertrix copied to clipboard

[Feature]: How to have >100 e.g. 10K or more seeds in a "list of pages".

Open tuehlarsen opened this issue 11 months ago • 2 comments

What change would you like to see?

We would like to have the possibility to increase the maxnumber of seeds or Urls in a list of pages. Today the maxlimit is hardwired to 100. see screen dump here: image

Context

see above

tuehlarsen avatar Jan 15 '25 11:01 tuehlarsen

Yes, this is something we'd like to support, including unlimited number of page URLs. We can raise this limit a bit, but need to update how we store the large list on the backend. Perhaps we'd add support for uploading a text file instead of entering URLs in the textbox here, and just store the file in the S3 bucket.

ikreymer avatar Jan 17 '25 21:01 ikreymer

Yup +1 for this, I routinely have to run crawls with 10k+ URLs. My current workaround is to split the list of URLs into batches of crawls with 25 URLs each using the API. It works well for the actual crawling, but unfortunately viewing a collection with too many crawls in it seems to crash the frontend container and shows "no pages found" when trying to replay.

Image
# expects a file urls.txt in the current working directory containing all URLs

import requests

# --------------------------------------------------------
# Configuration
# --------------------------------------------------------
BROWSERTIX_API_BASE = "http://browsertrix.example.com"
ORG_ID = "your-browsertrix-org-id"  # Replace with your actual org UUID
AUTH_TOKEN = "your-browsertrix-auth-token"  # Replace with your actual auth token
PROFILE_ID = "your-browsertrix-profile-id"  # Replace with your actual browser profile UUID
COLLECTION_ID = "your-browsertrix-collection-id"  # Replace with your actual collection UUID

# The endpoint to create a new crawl config is:
# POST /api/orgs/{oid}/crawlconfigs/
# --------------------------------------------------------


def chunker(seq, size):
    """
    Generator to yield successive chunks of a given list (seq) of the given size.
    """
    for pos in range(0, len(seq), size):
        yield seq[pos : pos + size]


def main():
    # 1) Read in all URLs from urls.txt (assumes exactly 10,000 lines)
    with open("urls.txt", "r", encoding="utf-8") as f:
        all_urls = [line.strip() for line in f if line.strip()]

    # 2) Split URLs into chunks of 25
    chunk_size = 25
    chunks = list(chunker(all_urls, chunk_size))

    # 3) For each chunk, create a new crawl config and set "runNow": True
    #    This will instruct Browsertrix to start the crawl immediately.
    headers = {
        "Authorization": f"Bearer {AUTH_TOKEN}",
        "Content-Type": "application/json",
    }

    for i, urls_subset in enumerate(chunks, start=1):
        if i == 1:
            continue
        # Prepare the seeds block
        seeds = [{"url": u} for u in urls_subset]

        # Body matches the CrawlConfigIn schema
        payload = {
            "name": f"BulkCrawl #{i}",
            "runNow": True,
            "jobType": "url-list",
            "profileid": PROFILE_ID,
            "tags": [ "labelstudio" ],
            "autoAddCollections": [ COLLECTION_ID ],
            "config": {
                "seeds": seeds,
                "scopeType": "page",
                "workers": 4,
                "postLoadDelay": 5,
                # You can set other config fields here if needed:
                # "blockAds": True,
                # "useSitemap": True,
                # etc.
            },
        }

        url = f"{BROWSERTIX_API_BASE}/api/orgs/{ORG_ID}/crawlconfigs/"
        resp = requests.post(url, json=payload, headers=headers)

        if resp.status_code == 200:
            data = resp.json()
            print(f"Created crawl config {i} successfully. ID={data.get('id')}")
        else:
            print(
                f"Error creating crawl config {i}. "
                f"HTTP {resp.status_code}. Response: {resp.text}"
            )


if __name__ == "__main__":
    main()

pirate avatar Jan 27 '25 20:01 pirate

@tw4l To create feature document as first step, likely implementation involves uploading list as a file that crawler can download

tw4l avatar May 14 '25 20:05 tw4l

Supported in 1.18, which will be released shortly!

tw4l avatar Jul 23 '25 20:07 tw4l

Documentation on how to specify more than 100 seeds by using a file: https://docs.browsertrix.com/user-guide/workflow-setup/#list-of-pages

SuaYoo avatar Aug 18 '25 18:08 SuaYoo