heritrix3 icon indicating copy to clipboard operation
heritrix3 copied to clipboard

How to crawl a site behind basic authentication (CredentialStore/HttpAuthenticationCredential ends up with 401)

Open danijanos opened this issue 5 months ago • 2 comments

Dear Heritrix3 Community,

Thank you for this great tool! Please help me with this issue: I am using version 3.10.0.

I need to crawl a site's previous version that has undergone a major upgrade. The old site was placed under a domain that the developers configured to be behind a basic login. (Every request header sent out includes the Authorization field, which supplies credentials for basic authentication with the base64-encoded value of the username and password, as granted by the site administrators.)

Image

I configured the job as I learned from the docs. So the crawl has these two beans for the basic authentication:

<bean id="credentialStore" class="org.archive.modules.credential.CredentialStore">
   <property name="credentials">
     <map>
       <entry key="OLDSiteLoginCredential" value-ref="OLDSiteLoginCredential"/>
     </map>
   </property>
</bean>

<bean id="OLDSiteLoginCredential" class="org.archive.modules.credential.HttpAuthenticationCredential">
   <property name="domain" value="https://old.site.edu:443"/>
   <property name="realm" value="oldsiterealm"/>
   <property name="login" value="myloginname"/>
   <property name="password" value="passwordformyloginname"/>
</bean>

But every time I build and launch it, it stops and finishes with the DNS resolve, and two 401s regarding the main page URL and the robots.txt

401        381 https://old.site.edu/ - - text/html #001
401        381 https://old.site.edu/robots.txt P https://old.site.edu/ text/html #001
1          51  dns:old.site.edu P https://old.site.edu/ text/dns #001

Could you please help me identify what I am doing wrong here? Or would you happen to know how I should do this? Thanks a lot!

danijanos avatar Jul 22 '25 11:07 danijanos

I haven't used this feature myself, but the documentation says:

The configured domain should be of the form “hostname:port” unless the port is 80 in which case it must be omitted. For HTTPS URLs without an explicit port use port 443.

<property name="domain" value="www.example.org:443"/>

So I would suggest removing "https://" from the domain:

  <property name="domain" value="old.site.edu:443"/>

ato avatar Jul 23 '25 03:07 ato

Hi and thanks @ato! I've tried your suggestion, but it doesn't work; I still get 401 and can't crawl the site. To check if I am wrong, I've tried with wget's --user and --password options, and the request goes through with 200 OK. So I assume I still configure the HttpAuthenticationCredential bean incorrectly?

danijanos avatar Jul 28 '25 12:07 danijanos