apertium-html-tools
apertium-html-tools copied to clipboard
Bandwidth-efficient document translation
Hi! I just wanted to share the following idea:
Basics:
- Odt and docx files are ZIP archives
- Only the textual content is needed for translation
- Files within ZIP archives are compressed individually
The following process could therefore (perhaps dramatically) reduce the network bandwidth required for document translation:
- Read the entire document file into RAM using javascript
- Copy the (compressed) file chunks corresponding to the textual content to a new data structure and attach a new ZIP header
- Submit the stripped-down (but still correct) ZIP file for translation
- Reintegrate the response into the original ZIP and update the header
- Put everything in a data URL and let the user save it to disk
No API-changes would be required, because the client-side script basically just strips unneeded content (e.g. pictures) from the ZIP (i.e. odt or docx) file and apertium does not care about those anyway.