mik
mik copied to clipboard
CsvBooks doesn't like periods in filenames
I got a bunch of files with names like this: 1987.0019.0039.0001.TIF
I set page_sequence_separator = .
Result:
[2019-02-13 18:44:34] input validator.ERROR: Input validation failed {"record ID":"1","book object directory":"/Volumes/Barkerville/RG136/books/1","error":"Some files in the book object directory have invalid extensions"} []
If I change the filename: 1987.0019.0039_0001.tif
and set page_sequence_separator = _
, it works.
Hypothesis: using a period as a page sequence separator doesn't work because MIK reads ".tif" as the page rather than the extension.
Alternate hypothesis: MIK is reading .0019.0039.0001.tif
as a file extension rather than as part of a filename followed by a page number followed by extension.
Is it possible to fix this?
I think this is more to do with using TIF
as an extension that using a period. The validation error is coming from https://github.com/MarcusBarnes/mik/blob/master/src/inputvalidators/CsvBooks.php#L140. Coincidentally I opened a PR (#496) two weeks ago that will let you indicate TIF
as a valid file extension.
Just to be sure, can you change the filenames back to using periods as separators but leave the (currently invalid until we merge #496) TIF
extension as is and see what happens?
Sorry, my logic is wrong. Change the extension to tif
and try using periods as separators.
OK, a couple of tests:
Separator _
, extension tif
= success
Separator _
, extension TIF
= fail
Separator .
, extension TIF
= fail
Separator .
, extension tif
= success
So it looks like the uppercase/lowercase extension is the culprit.
But interestingly, it behaves very unpredictably in my set -- because the first record is not .0001.tif
but instead .0000.tif
.
The presence of this file results in weird behaviours. Typically directory 0 just isn't generated, but it can also cause MIK to skip, say, directory 2, or more directories.
@bondjimbond when you say "can also", are you saying that it behaves differently across runs of MIK with the same input data?
when you say "can also", are you saying that it behaves differently across runs of MIK with the same input data?
Yes... I was running with a test directory of just four images (0000.tif through 0003.tif).
The first few runs, it produced directories 1, 2, and 3.
Later runs, just 1 and 3.
Another later run, just 3.
After removing 0000.tif, it consistently produced 1, 2, and 3.
Could you zip up all your data and config files and send them to me so I can try to replicate that?
Sure, here's my test directory and ini file: https://vault.sfu.ca/index.php/s/ChGaez7NLOygY3w
OK, got it, I'll give it a try this evening.
Can you send me your mappings file?
Ack, of course, sorry barkerville_mapping.txt
@bondjimbond Strangely, I can't replicate the behavior you are seeing. I ran MIK about 10 times and always got the same thing: page objects for pages 1-3 and an error indicating a problem with the 0000
file ([...]
added by me):
"message":"mkdir(): File exists" [...] "filename_segments":["1987","0019","0039","0000"],"page_number":""
The problem is coming from https://github.com/MarcusBarnes/mik/blob/master/src/writers/CsvBooks.php#L132-L134: since we trim all left padding 0s, we need something other than a 0000 as the page number. I'm not sure a fix to allow 0000
as a page number would be trivial.
Although a check to see if $page_number
is an empty string, and if it is, assign it a value of 0
to create a 0
directory, might be a simple fix. But do you want 0
to be the first page number instead of 1
?
Something like:
$page_number = ltrim(end($filename_segments), '0');
if (strlen($page_number) === 0) {
$page_number = '0';
}
$page_level_output_dir = $book_level_output_dir . DIRECTORY_SEPARATOR . $page_number;
mkdir($page_level_output_dir);
Just tried that, it worked:
/tmp/brandon_books/
└── 1987
├── 0
│ ├── MODS.xml
│ └── OBJ.tif
├── 1
│ ├── MODS.xml
│ └── OBJ.tif
├── 2
│ ├── MODS.xml
│ └── OBJ.tif
├── 3
│ ├── MODS.xml
│ └── OBJ.tif
└── MODS.xml
MODS.xml for page 0 is:
<titleInfo>
<title>This is a title, page 0</title>
</titleInfo>
</mods>
MODS.xml for page 1 is:
<titleInfo>
<title>This is a title, page 1</title>
</titleInfo>
</mods>
That's exactly what I need! :)
OK, I can open a PR for this if you want.
Please do!
I've made the same change to the CsvNewspapers writer and pushed up the issue-498 branch. I'll need to assemble some test data later but once I do that I'll open a PR.