mik CsvBooks doesn't like periods in filenames

I got a bunch of files with names like this: 1987.0019.0039.0001.TIF

I set page_sequence_separator = .

Result:

[2019-02-13 18:44:34] input validator.ERROR: Input validation failed {"record ID":"1","book object directory":"/Volumes/Barkerville/RG136/books/1","error":"Some files in the book object directory have invalid extensions"} []

If I change the filename: 1987.0019.0039_0001.tif and set page_sequence_separator = _, it works.

Hypothesis: using a period as a page sequence separator doesn't work because MIK reads ".tif" as the page rather than the extension.

Alternate hypothesis: MIK is reading .0019.0039.0001.tif as a file extension rather than as part of a filename followed by a page number followed by extension.

Is it possible to fix this?

Feb 13 '19 18:02 bondjimbond

I think this is more to do with using TIF as an extension that using a period. The validation error is coming from https://github.com/MarcusBarnes/mik/blob/master/src/inputvalidators/CsvBooks.php#L140. Coincidentally I opened a PR (#496) two weeks ago that will let you indicate TIF as a valid file extension.

Just to be sure, can you change the filenames back to using periods as separators but leave the (currently invalid until we merge #496) TIF extension as is and see what happens?

Feb 16 '19 17:02 mjordan

Sorry, my logic is wrong. Change the extension to tif and try using periods as separators.

Feb 16 '19 18:02 mjordan

OK, a couple of tests:

Separator _, extension tif = success Separator _, extension TIF = fail Separator ., extension TIF = fail Separator ., extension tif = success

So it looks like the uppercase/lowercase extension is the culprit.

But interestingly, it behaves very unpredictably in my set -- because the first record is not .0001.tif but instead .0000.tif.

The presence of this file results in weird behaviours. Typically directory 0 just isn't generated, but it can also cause MIK to skip, say, directory 2, or more directories.

Feb 19 '19 14:02 bondjimbond

@bondjimbond when you say "can also", are you saying that it behaves differently across runs of MIK with the same input data?

Feb 19 '19 15:02 mjordan

when you say "can also", are you saying that it behaves differently across runs of MIK with the same input data?

Yes... I was running with a test directory of just four images (0000.tif through 0003.tif).

The first few runs, it produced directories 1, 2, and 3.

Later runs, just 1 and 3.

Another later run, just 3.

After removing 0000.tif, it consistently produced 1, 2, and 3.

Feb 19 '19 15:02 bondjimbond

Could you zip up all your data and config files and send them to me so I can try to replicate that?

Feb 19 '19 15:02 mjordan

Sure, here's my test directory and ini file: https://vault.sfu.ca/index.php/s/ChGaez7NLOygY3w

Feb 19 '19 15:02 bondjimbond

OK, got it, I'll give it a try this evening.

Feb 19 '19 15:02 mjordan

Can you send me your mappings file?

Feb 19 '19 15:02 mjordan

Ack, of course, sorry barkerville_mapping.txt

Feb 19 '19 15:02 bondjimbond

@bondjimbond Strangely, I can't replicate the behavior you are seeing. I ran MIK about 10 times and always got the same thing: page objects for pages 1-3 and an error indicating a problem with the 0000 file ([...] added by me):

"message":"mkdir(): File exists" [...] "filename_segments":["1987","0019","0039","0000"],"page_number":""

The problem is coming from https://github.com/MarcusBarnes/mik/blob/master/src/writers/CsvBooks.php#L132-L134: since we trim all left padding 0s, we need something other than a 0000 as the page number. I'm not sure a fix to allow 0000 as a page number would be trivial.

Feb 20 '19 03:02 mjordan

Although a check to see if $page_number is an empty string, and if it is, assign it a value of 0 to create a 0 directory, might be a simple fix. But do you want 0 to be the first page number instead of 1?

Something like:

$page_number = ltrim(end($filename_segments), '0');
if (strlen($page_number) === 0) {
  $page_number = '0';
}
$page_level_output_dir = $book_level_output_dir . DIRECTORY_SEPARATOR . $page_number;
mkdir($page_level_output_dir);

Feb 20 '19 03:02 mjordan

Just tried that, it worked:

/tmp/brandon_books/
└── 1987
    ├── 0
    │   ├── MODS.xml
    │   └── OBJ.tif
    ├── 1
    │   ├── MODS.xml
    │   └── OBJ.tif
    ├── 2
    │   ├── MODS.xml
    │   └── OBJ.tif
    ├── 3
    │   ├── MODS.xml
    │   └── OBJ.tif
    └── MODS.xml

MODS.xml for page 0 is:

  <titleInfo>
    <title>This is a title, page 0</title>
  </titleInfo>
</mods>

MODS.xml for page 1 is:

  <titleInfo>
    <title>This is a title, page 1</title>
  </titleInfo>
</mods>

Feb 20 '19 03:02 mjordan

That's exactly what I need! :)

Feb 20 '19 14:02 bondjimbond

OK, I can open a PR for this if you want.

Feb 20 '19 15:02 mjordan

Please do!

Feb 20 '19 15:02 bondjimbond

I've made the same change to the CsvNewspapers writer and pushed up the issue-498 branch. I'll need to assemble some test data later but once I do that I'll open a PR.

Feb 20 '19 15:02 mjordan

mik mik copied to clipboard

CsvBooks doesn't like periods in filenames

mik
mik copied to clipboard