How to crawl a site behind basic authentication (CredentialStore/HttpAuthenticationCredential ends up with 401)
Dear Heritrix3 Community,
Thank you for this great tool! Please help me with this issue: I am using version 3.10.0.
I need to crawl a site's previous version that has undergone a major upgrade. The old site was placed under a domain that the developers configured to be behind a basic login. (Every request header sent out includes the Authorization field, which supplies credentials for basic authentication with the base64-encoded value of the username and password, as granted by the site administrators.)
I configured the job as I learned from the docs. So the crawl has these two beans for the basic authentication:
<bean id="credentialStore" class="org.archive.modules.credential.CredentialStore">
<property name="credentials">
<map>
<entry key="OLDSiteLoginCredential" value-ref="OLDSiteLoginCredential"/>
</map>
</property>
</bean>
<bean id="OLDSiteLoginCredential" class="org.archive.modules.credential.HttpAuthenticationCredential">
<property name="domain" value="https://old.site.edu:443"/>
<property name="realm" value="oldsiterealm"/>
<property name="login" value="myloginname"/>
<property name="password" value="passwordformyloginname"/>
</bean>
But every time I build and launch it, it stops and finishes with the DNS resolve, and two 401s regarding the main page URL and the robots.txt
401 381 https://old.site.edu/ - - text/html #001
401 381 https://old.site.edu/robots.txt P https://old.site.edu/ text/html #001
1 51 dns:old.site.edu P https://old.site.edu/ text/dns #001
Could you please help me identify what I am doing wrong here? Or would you happen to know how I should do this? Thanks a lot!
I haven't used this feature myself, but the documentation says:
The configured domain should be of the form “hostname:port” unless the port is 80 in which case it must be omitted. For HTTPS URLs without an explicit port use port 443.
<property name="domain" value="www.example.org:443"/>
So I would suggest removing "https://" from the domain:
<property name="domain" value="old.site.edu:443"/>
Hi and thanks @ato!
I've tried your suggestion, but it doesn't work; I still get 401 and can't crawl the site.
To check if I am wrong, I've tried with wget's --user and --password options, and the request goes through with 200 OK. So I assume I still configure the HttpAuthenticationCredential bean incorrectly?