capstone icon indicating copy to clipboard operation
capstone copied to clipboard

Wrong first_page values

Open ChefAndy opened this issue 4 years ago • 2 comments

A number of cases have incorrect values for first_page.

Some of them seem to be off-by-c-cases, and some seem to be off-by-n-pages.

Jack made this observations:

      <file ID="tiff_00242_1" MIMETYPE="image/tiff" CHECKSUM="f41fbffea9879f004c9e173157fc4a3c" CHECKSUMTYPE="MD5" SIZE="52268">
        <FLocat LOCTYPE="URL" xlink:href="images/32044078598745_00242_1.tif"/>
      </file>
      <file ID="tiff_00246_0" MIMETYPE="image/tiff" CHECKSUM="782c8036ae374018d2c88f9db152267a" CHECKSUMTYPE="MD5" SIZE="27176">
        <FLocat LOCTYPE="URL" xlink:href="images/32044078598745_00246_0.tif"/>
      </file>

<-- where the page numbers go wrong

then later in the file
      <div ORDER="484" ORDERLABEL="476" TYPE="page">
        <fptr FILEID="tiff_00242_1"/>...
      <div ORDER="485" ORDERLABEL="477" TYPE="page">
        <fptr FILEID="tiff_00246_0"/>

ChefAndy avatar Feb 26 '20 19:02 ChefAndy

status update: this is next on my docket.

ChefAndy avatar Apr 02 '20 18:04 ChefAndy

Apparently it wasn't.

ChefAndy avatar Sep 15 '21 20:09 ChefAndy

Slack context:

not sure where the error entered, but i'm showing 32044078598745_redacted_ALTO_00242_1.xml.gz as page label 476 from innodata, and alto/32044078598745_redacted_ALTO_00246_0.xml.gz as page label 477 like they dropped a bunch of page images and kept labeling consecutively so one angle on fixing this is finding missing page ranges like that -- i wonder if the other vol has the same issue another angle is in the alto for 00246_0 there's a block labeled pagelabel with the correct text 483, so we could crosscheck from there though that would be noisy but like a report of alto pages where pagelabel doesn't match the volmets ORDERLABEL might reveal a lot of stretches that could be quickly confirmed in a spreadsheet if you have a whole run of consecutive pagelabels with offset orderlabels

jcushman avatar Jan 12 '23 19:01 jcushman