Unexpected UTF-8 problems
This was on a Linux system, and the "A~-" was an "Ö".
- [x] Fix
Ã.problem above - [x] Fix
LookupError: unknown encoding: EUC-TWproblem
For plain text files it would be best to
-
[x] Review CLI
- [x]
cli.py(esp.process_dir) - [x]
ocrd_cli.py- any plain text files supported here? - [x]
cli_line_dirs.py - [x]
cli_summarize.py?
- [x]
-
[x] add
--plain-encodingoption so users have the chance to give it manually -
[x] Fall back to detecting
-
[x] while warning about the auto detecting
-
[x] What about the BOM now?
- [x] Do we have a test that checks if files with BOM are read correctly?
Later
- [ ] Autodetect over all files
- [ ] falling back to UTF-8 if the detected charset is way out there/unknown like
EUC-TW
Happens with our merged test directory.
Another one, this time with test (current dataset):
(dinglehopper) mike.gerber@lx0246:~$ sh /data-ssd/mike.gerber/dta-gt-data/test-eval.sh
Traceback (most recent call last):
File "/home/mike.gerber/.pyenv/versions/dinglehopper/bin/dinglehopper-line-dirs", line 8, in <module>
sys.exit(main())
File "/home/mike.gerber/.pyenv/versions/3.9.20/envs/dinglehopper/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/mike.gerber/.pyenv/versions/3.9.20/envs/dinglehopper/lib/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/mike.gerber/.pyenv/versions/3.9.20/envs/dinglehopper/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/mike.gerber/.pyenv/versions/3.9.20/envs/dinglehopper/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/mike.gerber/devel/dinglehopper/src/dinglehopper/cli_line_dirs.py", line 232, in main
process(
File "/home/mike.gerber/devel/dinglehopper/src/dinglehopper/cli_line_dirs.py", line 128, in process
gt_text = plain_extract(gt_fn, include_filename_in_id=True)
File "/home/mike.gerber/devel/dinglehopper/src/dinglehopper/ocr_files.py", line 167, in plain_extract
with open(filename, "r", encoding=fileencoding) as f:
LookupError: unknown encoding: EUC-TW
chardet seems to be bad at dealing with these short tests:
In [2]: print(chardet.detect("Nur zum Prüfen von 'chardet'.".encode("utf-8")))
{'encoding': 'ISO-8859-9', 'confidence': 0.6587004243912733, 'language': 'Turkish'}
For plain text files it would be best to
- add
--plain-encodingoption so users have the chance to give it manually - Fall back to detecting, but over all files
- while warning about the auto detecting
- falling back to UTF-8 if the detected charset is way out there/unknown like
EUC-TW
Branch now has --plain-encoding and warns about auto-detecting (for dinglehopper-line-dirs)
This probably gives up problems with the UTF-8 BOM again, need to check.
We also need to review the CLIs again, I don't even remember we had an option to process directories (!= directories of lines)...
Note: working in the feat/flex-line-dirs branch on this, because 1. it came up there 2. the line dirs are especially affected because short texts are the input format there.
cli.pynow has a--plain-encodingoption.cli_summarize.pyonly works with reports and is not affected.ocrd_cli.pyalso supports plain text files, so the OCR-D processor has aplain_encodingparameter now.
Hi Mike, could you release this feature? We're currently having this problem and waiting for the option to specify the encoding in the Python code :)
Hi Mike, could you release this feature? We're currently having this problem and waiting for the option to specify the encoding in the Python code :)
Working on it, it's unfortunately a bit intertwined with the line-dir updates I have in the queue (and requires a bit of work to test all the different CLIs we have now.)
Hi Mike, could you release this feature?
It's a bug, not a feature 🤓
I've updated the task list above and put two items into the "later" category. There's only one thing I want to take a look at before merging the fixes (currently living in the related feat/flex-line-dirs branch): Checking if the BOM is handled correctly. We had a user with a BOM-related problem and I don't want a regression here.
Fixes for this are now mostly[^1] merged into master and ~will be in the next release~ are released in 0.11.0.
[^1]: except for the task under "Later"