Crawl JS and CSS
Is there a way we can store JS, CSS and/or any other assets besides html with the pages.jsonl file ?
I am trying to crawl everything and have a list of all endpoints, so I can perform a health-check on every single endpoint located on the web app.
I was looking through the flags and wasn't able to find anything related to the assets type to be crawled
Hi @Dooriin, the pages.jsonl file is meant to be an index of the HTML pages only, but you should be able to find everything that was crawled in the CDXJ indices. These will be within the WACZ file if you're using the --generateWACZ flag, or can be generated separately with --generateCDX.
Let us know if that helps!
@tw4l thank you for your reply!
Is there a possibility to request this feature, like we have pages.jsonl fille, also to get resources.jsonl fille for assets such as JS, CSS, PDF or any other and so we can filter it with flags perhaps ?
I am using that for testing purposes and so if one asset if 404 then it has to be addressed immediately
I will be happy to support and contribute towards the project by Sponsoring :) if that is something that can be achieved
Looking forward to hearing back!
Hi @Dooriin, sorry for the delayed response! This is something we're actually looking into now as we develop features around assisted crawl QA in Browsertrix Cloud.
We have a PR merged in the dev-1.0.0 branch that lists out the page resources and their status codes as records inside the WARC files. You can see an example here: https://github.com/webrecorder/browsertrix-crawler/issues/457.
I'm wondering if that would help with your use case. It's possible that we could add an argument to the crawler to add these urls to the pages.jsonl (or a similar resources.jsonl as you suggest) if it'd be helpful to have it exposed at a higher level in the WACZ than inside the WARC files, that's just convenient for us in how we plan to handle QA runs.
@Dooriin Can you explain more what you're looking for? We will soon have the CDXJ index generated while the crawl is running, so you can also peek in the tmp-cdx directory to get a list of all the resources captured. We could also add more extended logging if you want to parse the container stdout, for each URL that is being retrieved, that is also doable.
Hi @ikreymer I am using your product as a testing tool. Crawling through the application, getting all the URLs and then making sure they are all functioning and are returning 200
What I was wondering if, there's a way to add few features such as. Getting all assets such as js and css for each page and make sure they are also 200. Was curious if there's a way to add like an array, for each page what assets it has or something like that
I am mostly referring to crawls/collections/xxx/pages/pages.jsonl file. Would be great to have parent url next to the crawler url
Hi @ikreymer I am using your product as a testing tool. Crawling through the application, getting all the URLs and then making sure they are all functioning and are returning 200
This tool is really designed for archiving, not testing, and we have special formats intended for storing and replaying archived data at a later time. If your goal is just testing and ensuring correct status code, I'd suggest using something like Playwright which is designed specifically for that use case, see: Playwright Response Interception You might find that to be easier to use for what you're trying to do.
What I was wondering if, there's a way to add few features such as. Getting all assets such as js and css for each page and make sure they are also 200. Was curious if there's a way to add like an array, for each page what assets it has or something like that
We do generate a urn:pageinfo:<url> record in the WARC file that contains all the resources on the page, but again, this is designed to be used as an archival format, not for testing.
Closing as this has mostly been answered.
@ikreymer Thanks for your reply! The main reason I use this tool is to crawl through the application, as we have over 1000+ pages.
I will keep that in mind!