staticSearch
staticSearch copied to clipboard
Handle tagsoup HTML
We've always had the constraint that we can only index websites which are well-formed XML in the XHTML namespace, which is a considerable limitation; it doesn't apply to our own projects, but it severely limits the number of external users we could serve. This project: https://github.com/UVicHCMC/rescueTagSoup uses the parsehtml-1.4.jar to turn any-old-html into well-formed XHTML, and then goes on to do other remediations. That initial step could be integrated into staticSearch to provide support for all websites.
- We would need to add a preliminary step to the build which would pre-process the source HTML into a temporary folder, retaining the directory structure (the rescueTagSoup project does exactly this), and then index that temporary folder.
- We would need to avoid doing this when it's not necessary (for our own projects, for example), otherwise the build process will be pointlessly extended. We could do this with a configuration file item called (say) "fixHtml", which you could turn on if you know you need it; and if you don't turn it on, and tokenization fails due to ill-formed HTML, an error message would suggest it.
This is obviously not for the 2.0 release, but isn't too complicated so it could be integrated into 2.1.