heritrix3 icon indicating copy to clipboard operation
heritrix3 copied to clipboard

Authentication on servers using Oauth2

Open AndreSchmutz opened this issue 2 years ago • 4 comments

Dear development team,

We would like to crawl our intranet, and have installed Heritrix 3.4 on a Linux server. The crawling starts, but stops immediately at the authentication phase. Our intranet uses Oauth2 authentication.

We have seen that for the https server Heritrix uses org.restlet, and that org.restlet.ext.oauth implements the Oauth authentication.

Now I understand that Heritrix' implementation of org.restlet is on the server side and not the client side.

Is there a possibility to get Oauth2 client authentication working with Heritrix ?

Thanks in advance for your help.

AndreSchmutz avatar Nov 23 '21 15:11 AndreSchmutz

Heritrix doesn't have any special support for OAuth2 client authentication. You are correct that Restlet is not used in the crawling process and it's only used for Heritrix's user interface.

I haven't tried crawling a site authenticated by OAuth2 myself but these are two ideas I'd try if I needed to do this:

Option 1. Form login

Usually the way OAuth2 authentication works is an application redirects the browser to an authentication server, the user fills in a login form and the authentication server redirects back to the application which sets a session cookie which will authenticate every request. The browser itself has no special support awareness of OAuth2 it just needs support forms, redirects and cookies.

Heritrix does have some code for form logins. I've never used it myself though so can't provide any guidance as to how to configure it or whether it's likely to work in your case. It looks a bit complex to setup. There's some documentation in ExtractorHTMLForms and some more explanation in this JIRA ticket.

Option 2. Supply session cookie externally

If I needed to do this myself the first thing I'd try is to just login with the browser and then use the browser's devtools to copy the resulting session cookie into a file and configure the Heritrix cookie store load it with:

<bean id="cookieStore" class="org.archive.modules.fetcher.BdbCookieStore">
    <property name="cookiesLoadFile">intranet-cookies.txt</property>
</bean>

See AbstractCookiesStore.readCookies(Reader) for details of the cookie file format.

If that worked and I needed to automate the process then I'd write a separate script to do the OAuth2 login process.

ato avatar Nov 23 '21 18:11 ato

@ato: When I use that Option 2 I get Element 'property' cannot have character [children], because the type's content type is element-only.

Am I missing something?

TheTechRobo avatar Nov 26 '21 23:11 TheTechRobo

@ato Thank you so much, and sorry to come back only now.

Our Intranet expert said that the first proposal would not work as our Authentication server does not recognise Forms login.

Our Oauth 2 is not using Cookies storage, but Local Storage.

We found https://www.javadoc.io/static/org.archive.heritrix/heritrix-modules/3.4.0-20210923/org/archive/modules/recrawl/PersistOnlineProcessor.html but are stuck there, because it does not seem to be able to read Local Storage information.

Would you have a tip on how to make progress ?

AndreSchmutz avatar Jan 18 '22 16:01 AndreSchmutz

@TheTechRobo ah, my mistake. It apparently needs to be wrapped in an instance of ConfigFile:

    <bean id="cookieStore" class="org.archive.modules.fetcher.BdbCookieStore">
      <property name="cookiesLoadFile">
         <bean class="org.archive.spring.ConfigFile">
           <property name="path" value="cookies.txt" />
         </bean>
      </property>
    </bean>

Our Oauth 2 is not using Cookies storage, but Local Storage.

@AndreSchmutz That seems a little surprising to me as local storage can only be accessed by JavaScript and isn't directly accessible to the server. If accessing the site truly depends on local storage and therefore JavaScript then since Heritrix cannot execute JavaScript I don't think it will be possible, at least not without a deep understanding of how the site works and some custom code.

You may have more luck with a browser-based crawler like Browsertrix. I would suggest trying the method described in the "Interactive Profile Creation" section of the Browsertrix README.

ato avatar Jan 19 '22 01:01 ato