PdfPig
PdfPig copied to clipboard
Document creation time is not extracted when available
I've identified several real cases when the creation date is stored inside a document but is not extracted by the library. The following causes have been noticed:
- The date is stored not as a literal but rather as a reference to another section. Possible fix (UglyToad.PdfPig.Parser.DocumentInformationFactory.cs, method Create):
foreach (KeyValuePair<string, IToken> pair in infoParsed.Data)
{
IToken value = pair.Value;
if (!(value is IndirectReferenceToken reference))
{
continue;
}
NameToken key = NameToken.Create(pair.Key);
infoParsed = infoParsed.Without(key).With(key, DereferenceEntry(reference, pdfTokenScanner));
}
private static IToken DereferenceEntry(IToken value, IPdfTokenScanner pdfTokenScanner)
{
return value is IndirectReferenceToken reference ? pdfTokenScanner.Get(reference.Data).Data : value;
}
Unfortunately I see no way to iterate over the token collection in a less ugly way.
- The timestamp contains a space between date and time.
Possible fix (UglyToad.PdfPig.Util.DateFormatHelper, method TryParseDateTimeOffset):
// Supporting formats like "YYYYMMDD HHmmSS"
s = s.Replace(" ", string.Empty);
- The year inside the timestamp occupies 5 digits rather than 4, e. g. 19101 what should mean the 101th year of the XX century (most probably the Y2K issue). Presumably such a corrupted timestamp is only typical for the first few years of the XXI century. Possible fix (UglyToad.PdfPig.Util.DateFormatHelper, method TryParseDateTimeOffset):
// Gets a year with check for an eventual Y2K issue. An incorrect year would have the following format:
//
// 19YYY
//
// where "YYY" is a number of the year in the XX century and is greater than 99 (hence requiring an extra digit).
// YYY would hardly be greater than, say, 105.
bool GetYear(ref int pos, out int year)
{
// Getting a standard ISO-based year to return it whenever an Y2K-affected value is not identified
if (!int.TryParse(s.Substring(pos, 4), out year))
{
// Invalid value
pos += 4;
return false;
}
// A standard ISO datetime value has 14 digits (fractions of a second are not expected)
if (!HasRemainingCharacters(pos, 15))
{
pos += 4;
return true;
}
string centuryStr = s.Substring(pos, 2);
int century;
if (!int.TryParse(centuryStr, out century) || century != 19)
{
pos += 4;
return true;
}
string centuryYearStr = s.Substring(pos + 2, 3);
int centuryYear;
if (!int.TryParse(centuryYearStr, out centuryYear) || centuryYear < 100 || centuryYear > 105)
{
pos += 4;
return true;
}
pos += 5;
year = century * 100 + centuryYear;
return true;
}
Not sure how to share sample documents.
@aagubanov thanks for raising this issue and the thorough explanation.
You should be able to share dmaamole documents by drag/dropping them in the comment you write
@aagubanov thanks for raising this issue and the thorough explanation.
You should be able to share dmaamole documents by drag/dropping them in the comment you write
Thank you, it works.
@aagubanov thanks for providing the documents. I've created a fix PR to handle indirect references in doc info factory.
I think the date format issue is out of scope for PdfPig though