archive-hocr-tools icon indicating copy to clipboard operation
archive-hocr-tools copied to clipboard

hocr-to-epub: require hocr_xml_file_path to end with _hocr.html

Open milahu opened this issue 2 months ago • 0 comments

quickfix to avoid broken file paths

require hocr_xml_file_path to end with _hocr.html

the user can bypass this requirement by setting all file paths

before this patch ImageStack.parse_stack tried to use the hocr.html file as a jp2.zip file

$ hocr-to-epub -f 001.hocr -o 001.epub
Traceback (most recent call last):
  File "/nix/store/km6ybsgiig9bnrw2n5csw8ivasamsn90-archive-hocr-tools-1.1.67/bin/.hocr-to-epub-wrapped", line 750, in <module>
    EpubGenerator(
    ~~~~~~~~~~~~~^
        args.infile,
        ^^^^^^^^^^^^
    ...<4 lines>...
        use_kakadu=args.kakadu,
        ^^^^^^^^^^^^^^^^^^^^^^^
        ignore_broken_images=args.ignore_broken_images)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/km6ybsgiig9bnrw2n5csw8ivasamsn90-archive-hocr-tools-1.1.67/bin/.hocr-to-epub-wrapped", line 327, in __init__
    self.img_stack = ImageStack(
                     ~~~~~~~~~~^
            self.image_stack_zip_file_path,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            os.path.join(WORKING_DIR,"epub_img"),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            use_kakadu=use_kakadu,
            ^^^^^^^^^^^^^^^^^^^^^^
            ignore_broken_images=ignore_broken_images)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/km6ybsgiig9bnrw2n5csw8ivasamsn90-archive-hocr-tools-1.1.67/bin/.hocr-to-epub-wrapped", line 82, in __init__
    self.parse_stack()
    ~~~~~~~~~~~~~~~~^^
  File "/nix/store/km6ybsgiig9bnrw2n5csw8ivasamsn90-archive-hocr-tools-1.1.67/bin/.hocr-to-epub-wrapped", line 99, in parse_stack
    self.zf = tarfile.open(self.image_archive_file_path)
              ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/sd81bvmch7njdpwx3lkjslixcbj5mivz-python3-3.13.4/lib/python3.13/tarfile.py", line 1882, in open
    raise ReadError(f"file could not be opened successfully:\n{error_msgs_summary}")
tarfile.ReadError: file could not be opened successfully:
- method gz: ReadError('not a gzip file')
- method bz2: ReadError('not a bzip2 file')
- method xz: ReadError('not an lzma file')
- method tar: ReadError('invalid header')

after this patch, it fails early

$ hocr-to-epub -f 001.hocr -o 001.epub
Traceback (most recent call last):
  File "/nix/store/km6ybsgiig9bnrw2n5csw8ivasamsn90-archive-hocr-tools-1.1.67/bin/.hocr-to-epub-wrapped", line 750, in <module>
    EpubGenerator(
    ~~~~~~~~~~~~~^
        args.infile,
        ^^^^^^^^^^^^
    ...<4 lines>...
        use_kakadu=args.kakadu,
        ^^^^^^^^^^^^^^^^^^^^^^^
        ignore_broken_images=args.ignore_broken_images)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/km6ybsgiig9bnrw2n5csw8ivasamsn90-archive-hocr-tools-1.1.67/bin/.hocr-to-epub-wrapped", line 307, in __init__
    assert self.hocr_xml_file_path.endswith('_hocr.html')
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
AssertionError

milahu avatar Oct 23 '25 15:10 milahu