capstone
capstone copied to clipboard
Wrong first_page values
A number of cases have incorrect values for first_page.
Some of them seem to be off-by-c-cases, and some seem to be off-by-n-pages.
Jack made this observations:
<file ID="tiff_00242_1" MIMETYPE="image/tiff" CHECKSUM="f41fbffea9879f004c9e173157fc4a3c" CHECKSUMTYPE="MD5" SIZE="52268">
<FLocat LOCTYPE="URL" xlink:href="images/32044078598745_00242_1.tif"/>
</file>
<file ID="tiff_00246_0" MIMETYPE="image/tiff" CHECKSUM="782c8036ae374018d2c88f9db152267a" CHECKSUMTYPE="MD5" SIZE="27176">
<FLocat LOCTYPE="URL" xlink:href="images/32044078598745_00246_0.tif"/>
</file>
<-- where the page numbers go wrong
then later in the file
<div ORDER="484" ORDERLABEL="476" TYPE="page">
<fptr FILEID="tiff_00242_1"/>...
<div ORDER="485" ORDERLABEL="477" TYPE="page">
<fptr FILEID="tiff_00246_0"/>
status update: this is next on my docket.
Apparently it wasn't.
Slack context:
not sure where the error entered, but i'm showing 32044078598745_redacted_ALTO_00242_1.xml.gz as page label 476 from innodata, and alto/32044078598745_redacted_ALTO_00246_0.xml.gz as page label 477 like they dropped a bunch of page images and kept labeling consecutively so one angle on fixing this is finding missing page ranges like that -- i wonder if the other vol has the same issue another angle is in the alto for 00246_0 there's a block labeled pagelabel with the correct text 483, so we could crosscheck from there though that would be noisy but like a report of alto pages where pagelabel doesn't match the volmets ORDERLABEL might reveal a lot of stretches that could be quickly confirmed in a spreadsheet if you have a whole run of consecutive pagelabels with offset orderlabels