pdfparser getDetails() returning single element arrays instead of strings

In the sample PDF for #391 /samples/bugs/Issue391.pdf the results of getDetails() returns a set of values that are single element arrays instead of strings:

Expected:

array(6) {
  ["Title"]=>
  string(49) "Microsoft Word - GVTW70SPAH1R0_20101210_REV_A.doc"
  ["Author"]=>
  string(8) "aronzhao"
  ["Producer"]=>
  string(33) "Mac OS X 10.7.2 Quartz PDFContext"
  ["Creator"]=>
  string(26) "PScript5.dll Version 5.2.2"
  ["CreationDate"]=>
  string(25) "2011-12-28T21:12:01+00:00"
  ["ModDate"]=>
  string(25) "2011-12-28T21:12:01+00:00"
}

Actual:

array(6) {
  ["Title"]=>
  array(1) {
    [0]=>
    string(49) "Microsoft Word - GVTW70SPAH1R0_20101210_REV_A.doc"
  }
  ["Author"]=>
  array(1) {
    [0]=>
    string(8) "aronzhao"
  }
  ["Producer"]=>
  array(1) {
    [0]=>
    string(33) "Mac OS X 10.7.2 Quartz PDFContext"
  }
  ["Creator"]=>
  array(1) {
    [0]=>
    string(26) "PScript5.dll Version 5.2.2"
  }
  ["CreationDate"]=>
  array(1) {
    [0]=>
    string(25) "2011-12-28T21:12:01+00:00"
  }
  ["ModDate"]=>
  array(1) {
    [0]=>
    string(25) "2011-12-28T21:12:01+00:00"
  }
}

I knew arrays of XMP details values were possible, but not so for the regular details. I'll have to look at the PDF reference again, but they are probably allowed and PdfParser is parsing them properly.

My PR #606 automatically converts any XMP property that's a single element array into the value of that single element. Should we do that in this case for the regular details?

Jul 21 '23 15:07 GreyWyvern

Please have a look first to confirm how it has to look. I'd rather prefer one way all the time, so in this case keep the arrays and don't convert to a single value, if the value only contains one element. This way it is easier for data consumers to check data, because they can rely on an array structure rather than checking for string, array ... . What do you think?

Jul 23 '23 15:07 k00ni

Well, I can only say that I personally have gone through about ~100 PDFs from the PdfParser issues list and run them through my test environment. All but this one has strings as values for the results of getDetails(). So on the one hand, you could argue that string values for these properties are somewhat expected. It is probable that many people using this library will only check for a string value here and ignore it otherwise (or get an error) when it's an array.

On the other hand, by leaving it as arrays there is some feeling of fidelity to the original document info even if it may not be what the user expects. If this is changed, I would also want to change the XMP output to match. The XMP output would get a lot more complex of course, but at the same time closer to the actual structure of the XML.

My personal choice is to reduce single element arrays to strings, and leave them as arrays if they have more than one element. This (probably) catches >99% of all cases; I've yet to see a PDF with two Title values, etc. I'm not sure how Adobe Acrobat would even handle that.

Regardless of whatever is done, it seems it is possible for regular getDetails() values to be arrays, so this should definitely be noted in the docs.

Jul 31 '23 14:07 GreyWyvern

pdfparser pdfparser copied to clipboard

getDetails() returning single element arrays instead of strings

pdfparser
pdfparser copied to clipboard