staticSearch Handle tagsoup HTML

Handle tagsoup HTML

Open martindholmes opened this issue 5 months ago • 0 comments

We've always had the constraint that we can only index websites which are well-formed XML in the XHTML namespace, which is a considerable limitation; it doesn't apply to our own projects, but it severely limits the number of external users we could serve. This project: https://github.com/UVicHCMC/rescueTagSoup uses the parsehtml-1.4.jar to turn any-old-html into well-formed XHTML, and then goes on to do other remediations. That initial step could be integrated into staticSearch to provide support for all websites.

We would need to add a preliminary step to the build which would pre-process the source HTML into a temporary folder, retaining the directory structure (the rescueTagSoup project does exactly this), and then index that temporary folder.
We would need to avoid doing this when it's not necessary (for our own projects, for example), otherwise the build process will be pointlessly extended. We could do this with a configuration file item called (say) "fixHtml", which you could turn on if you know you need it; and if you don't turn it on, and tokenization fails due to ill-formed HTML, an error message would suggest it.

This is obviously not for the 2.0 release, but isn't too complicated so it could be integrated into 2.1.

Sep 30 '24 16:09 martindholmes

staticSearch staticSearch copied to clipboard

Handle tagsoup HTML

staticSearch
staticSearch copied to clipboard