Thomas Rowlands
Thomas Rowlands
Re-using this PR since it is based on the same branch of my fork. This now includes a merge of all existing branches within my fork (see changes above). Main...
Pretty much all OCR code is unused right now, we used to use tesseract years ago but it was experimental as far as I know and never fully implemented as...
reverted to draft as more changes are needed to deal with the potential for XML files to actually contain HTML content (or vice-versa)
Post sprint discussion this week, thought I'd summarise the plan for file type processing in AC in the longer term. So far, we aim to get HTML, XML, PDF, Word...
The XML/HTML version testing is definitely needed here, like attempting to parse them just to confirm the contents match the file extension, but we could just add it to the...
Implemented the updates suggested. Further refactors will be made to the codebase with other PRs once the supplementary material features are in.