dinglehopper icon indicating copy to clipboard operation
dinglehopper copied to clipboard

Unexpected UTF-8 problems

Open mikegerber opened this issue 1 year ago • 13 comments

image

This was on a Linux system, and the "A~-" was an "Ö".

  • [x] Fix Ã. problem above
  • [x] Fix LookupError: unknown encoding: EUC-TW problem

For plain text files it would be best to

  • [x] Review CLI

    • [x] cli.py (esp. process_dir)
    • [x] ocrd_cli.py - any plain text files supported here?
    • [x] cli_line_dirs.py
    • [x] cli_summarize.py?
  • [x] add --plain-encoding option so users have the chance to give it manually

  • [x] Fall back to detecting

  • [x] while warning about the auto detecting

  • [x] What about the BOM now?

    • [x] Do we have a test that checks if files with BOM are read correctly?

Later

  • [ ] Autodetect over all files
  • [ ] falling back to UTF-8 if the detected charset is way out there/unknown like EUC-TW

mikegerber avatar Dec 11 '24 13:12 mikegerber

Happens with our merged test directory.

mikegerber avatar Dec 12 '24 19:12 mikegerber

Another one, this time with test (current dataset):

(dinglehopper) mike.gerber@lx0246:~$ sh /data-ssd/mike.gerber/dta-gt-data/test-eval.sh
Traceback (most recent call last):
  File "/home/mike.gerber/.pyenv/versions/dinglehopper/bin/dinglehopper-line-dirs", line 8, in <module>
    sys.exit(main())
  File "/home/mike.gerber/.pyenv/versions/3.9.20/envs/dinglehopper/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/mike.gerber/.pyenv/versions/3.9.20/envs/dinglehopper/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/mike.gerber/.pyenv/versions/3.9.20/envs/dinglehopper/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mike.gerber/.pyenv/versions/3.9.20/envs/dinglehopper/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/mike.gerber/devel/dinglehopper/src/dinglehopper/cli_line_dirs.py", line 232, in main
    process(
  File "/home/mike.gerber/devel/dinglehopper/src/dinglehopper/cli_line_dirs.py", line 128, in process
    gt_text = plain_extract(gt_fn, include_filename_in_id=True)
  File "/home/mike.gerber/devel/dinglehopper/src/dinglehopper/ocr_files.py", line 167, in plain_extract
    with open(filename, "r", encoding=fileencoding) as f:
LookupError: unknown encoding: EUC-TW

mikegerber avatar Dec 16 '24 10:12 mikegerber

chardet seems to be bad at dealing with these short tests:

In [2]: print(chardet.detect("Nur zum Prüfen von 'chardet'.".encode("utf-8")))
{'encoding': 'ISO-8859-9', 'confidence': 0.6587004243912733, 'language': 'Turkish'}

For plain text files it would be best to

  • add --plain-encoding option so users have the chance to give it manually
  • Fall back to detecting, but over all files
  • while warning about the auto detecting
  • falling back to UTF-8 if the detected charset is way out there/unknown like EUC-TW

mikegerber avatar Dec 18 '24 12:12 mikegerber

Branch now has --plain-encoding and warns about auto-detecting (for dinglehopper-line-dirs)

image

mikegerber avatar Dec 19 '24 10:12 mikegerber

This probably gives up problems with the UTF-8 BOM again, need to check.

mikegerber avatar Dec 19 '24 10:12 mikegerber

We also need to review the CLIs again, I don't even remember we had an option to process directories (!= directories of lines)...

mikegerber avatar Dec 19 '24 13:12 mikegerber

Note: working in the feat/flex-line-dirs branch on this, because 1. it came up there 2. the line dirs are especially affected because short texts are the input format there.

mikegerber avatar Dec 19 '24 13:12 mikegerber

  • cli.py now has a --plain-encoding option.
  • cli_summarize.py only works with reports and is not affected.
  • ocrd_cli.py also supports plain text files, so the OCR-D processor has a plain_encoding parameter now.

mikegerber avatar Feb 13 '25 15:02 mikegerber

Hi Mike, could you release this feature? We're currently having this problem and waiting for the option to specify the encoding in the Python code :)

tdoan2010 avatar Apr 17 '25 12:04 tdoan2010

Hi Mike, could you release this feature? We're currently having this problem and waiting for the option to specify the encoding in the Python code :)

Working on it, it's unfortunately a bit intertwined with the line-dir updates I have in the queue (and requires a bit of work to test all the different CLIs we have now.)

mikegerber avatar Apr 22 '25 10:04 mikegerber

Hi Mike, could you release this feature?

It's a bug, not a feature 🤓

mikegerber avatar Apr 22 '25 10:04 mikegerber

I've updated the task list above and put two items into the "later" category. There's only one thing I want to take a look at before merging the fixes (currently living in the related feat/flex-line-dirs branch): Checking if the BOM is handled correctly. We had a user with a BOM-related problem and I don't want a regression here.

mikegerber avatar Apr 24 '25 14:04 mikegerber

Fixes for this are now mostly[^1] merged into master and ~will be in the next release~ are released in 0.11.0.

[^1]: except for the task under "Later"

mikegerber avatar Apr 24 '25 14:04 mikegerber