browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Crawl button with javascript navigation

Open hamzamac opened this issue 1 year ago • 5 comments

Hi, we are try to crawl a site that use

How can we crawl such a website with Browsertrix-crawler?

hamzamac avatar Aug 06 '24 12:08 hamzamac

Hi @hamzamac, would you be able to share the URL of the site you're trying to capture so I can take a look?

tw4l avatar Aug 06 '24 13:08 tw4l

Hi @tw4l, thank you for responding. The site is actually a SharePoint site with MFA. We manages to crawl it by creating a profile. but the links to folders appears to be spans. image

when when clicking the button on the replayweb.page it shows this error below image (the URL is pointing to is a public CDN URL which is accessible) Do we need to include all the URI for JavaScripts in the seeds?

hamzamac avatar Aug 06 '24 14:08 hamzamac

Hm, you shouldn't need to include the URIs for scripts - if the script is on the page, the crawler will discover it. This looks to me like it's more likely to be a bug in our replay engine than a missing script. It's hard to tell further without being able to reproduce it ourselves - would you be able to share a copy of the WACZ by email?

tw4l avatar Aug 06 '24 22:08 tw4l

Hi @tw4l, sure I will send the WACZ to the email on your profile.

hamzamac avatar Aug 07 '24 09:08 hamzamac

Hi @tw4l , can you please confirm if you have received the WACZ file? thanks.

hamzamac avatar Aug 22 '24 19:08 hamzamac