iis icon indicating copy to clipboard operation
iis copied to clipboard

Add support for processing javascript code when obtaining HTML pages

Open marekhorst opened this issue 7 years ago • 1 comments

Originally reported on redmine: #1756#note-181.

Apparently web crawler module, which is responsible for providing HTML pages describing software, needs to be supplemented with javascript code execution.

After inspecting HTML pages retrieved from Google Code it turned out 2193 pages out of 2674 entries in total contained the following phrase:

The Google Code Archive requires JavaScript to be enabled in your browser.

other cases were redirecting to e.g. GitHub and this was the only reason we were able to extract title for some of the Google Code URLs.

We would need to replace simple streaming from HTTP connection with more advanced solution capable of executing JavaScript.

marekhorst avatar Oct 23 '18 08:10 marekhorst

Currently tested solution was based on jBrowserDriver project. Even though it works properly in development infrastructure it does not work (hangs) on IIS cluster probably due to the following issue:

https://github.com/MachinePublishers/jBrowserDriver/issues/87#issuecomment-275955217

Proposed solution requires installation of openjfx lib on each datanode which is pretty cumbersome. We need to look for other options.

marekhorst avatar Oct 24 '18 12:10 marekhorst