datashare
datashare copied to clipboard
Wrong language detection of PDFs if metadata language is wrong
Describe the bug
If creator metadata is English (en_US etc.) but document text is actually in another language or script, the main language is set as English even if most of the text is in another language.
To Reproduce Steps to reproduce the behavior:
- Upload Burmese PDF for which the original creator metadata languge is set to en_us
- Analyse Document -> Extract Text
- Document is tagged as English, despite almost all of the text being in another language
Expected behavior Actual text of PDF is analysed and main language of actual text tagged. If possible, set multiple language tags see #309
Screenshots
Desktop (please complete the following information):
- OS: Linux
- Browser: Chrome
- Version: 89.0.4389.90 (Official Build)
Additional context File attached: