pdfbox Improve performance of FileSystemFontProvider.scanFonts()

On one machine with 792 fonts (~800MB) scanning takes around 5.5s, after these changes time is down to 0.7s (without checksumming), or 2.2s (with checksums).

In this PR, TTF parsers have an "only headers" mode where each table reads as little information as possible:

in this mode, whole file is not read to byte[], parser uses RandomAccessRead directly, because most of the file is skipped
only read 5 tables needed for FSFontInfo (name, head, OS/2, CFF , gcid)
table parsers finish as soon as they have all needed data
skip checksumming because it is now faster to simply re-parse the file (gated with pdfbox.fontcache.skipchecksums system property, for backward compatibility): checksumming 800MB takes 1.5s and parsing headers takes only 0.7s

Additional fixes:

fixed a memory leak: RandomAccessRead passed to TrueTypeCollection constructor was never freed.
~NamingTable: use sorted list instead of multilevel HashMap, delay-load Strings (for non-"only headers" mode)~
~TTFSubsetter: avoid bytes->string->bytes conversion~
~streamline I/O: replace readByte() with read(array)~
~consolidate read(buf, offset, len) loops into readNBytes() (allows underflow) and readExact() (throwing)~

Breaking changes:

if pdfbox.fontcache.skipchecksums property is set, .pdfbox.cache file has dashes instead of checksums
.pdfbox.cache file differs for corrupted fonts (like LastResort.otf that throws IOException: Invalid character code 0xD800):
- original cache lists it as *skipexception*|OTF||0|0|0|0|0||/System/Library/Fonts/LastResort.otf|6c14c91|1715065304000
- after this change it's listed as LastResort|OTF||||0|0|0||/System/Library/Fonts/LastResort.otf|6c14c91|1715065304000 because we simply do not parse it as extensively to recognize that it's corrupted
~NameRecord.getString() is now package-private and lazy, renamed to getStringLazy()~
~new abstract method TTFDataStream.createSubView()~ (used a default implementation instead)

Only tested with 3.0 branch, all tests pass, resulting file is the same (except for checksums field, of course, and some *skipexception*s).

Should I break it into smaller commits?

Apr 10 '24 01:04 bogdiuk

Thanks for the contribution, looks promising at least at a first sight.

And yes, please split up the patch im smaller parts so that it is easier to check the changes. We won't add breaking changes to 3.x, those are limited to the trunk. But maybe we find a way to backport at least some of them without breaking the api.

Apr 10 '24 07:04 lehmi

Hi, any update on this? Looks like a worthwhile enhancement (as far as I can tell).

May 21 '24 16:05 PascalSchumacher

I'm sorry, I didn't have much free time, but refactoring is in progress. I am shrinking changes to only apply to LoadOnlyHeaders case. As for those changes that affect both LoadOnlyHeaders and non-LoadOnlyHeaders parsers - I will push them separately, for easier testing. When done, I will post performance numbers for this subset of changes.

May 28 '24 18:05 bogdiuk

Force-pushed the new implementation, based on 3.0 branch (rev 5c4a09a5c0574c1701151202ede2be1857610a36). Also I edited the 1st post to reflect changes

Performance numbers: first number is a re-run (font files are cached by the OS), second is a cold run

Windows (296 fonts, Java 8 x32):

2.9s..4.0s - 3.0 branch
0.8s..1.9s - performance-scanfonts branch
0.3s..0.6s - performance-scanfonts +"pdfbox.fontcache.skipchecksums=true"

macOS (792 fonts, Java 22 arm64):

2.2..3.6s - 3.0 branch
0.4..0.5s - performance-scanfonts branch
0.16s..0.2s performance-scanfonts +"pdfbox.fontcache.skipchecksums=true"

Jun 18 '24 09:06 bogdiuk

Reasoning behind using a separate class LoadOnlyHeaders instead of extracting information directly from TrueTypeFont:

easier to debug - single place to put breakpoints
easier to refactor:
- for example, simple "Find All Usages" can reveal that setSomething() is not used
- the alternative is having boolean isLoadOnlyHeaders in TTFParser, CFFParser and TrueTypeFont: it would be harder to collect usages of all 3
in the future, parsing some tables can be skipped completely, only setting respective fields of LoadOnlyHeaders

Jun 18 '24 10:06 bogdiuk

Could you please rename "LoadOnlyHeaders" to something that isn't a verb? E.g. "FontHeaders" or whatever.

Logging should be done like it's done in the code, not with "Logger.getLogger".

I'd also like to separate refactorings and bug fixes from the main PR, I'll do some small fixes. Please rebase your PR when you notice it.

Jun 25 '24 10:06 THausherr

Re "fixed a memory leak", this is tricky: your change makes sense for the TrueTypeCollection(File file) constructor only. Not for the one with InputStream. That one (which isn't used by PDFBox itself) also reads the font into memory twice as it is now.

Jun 25 '24 10:06 THausherr

I've done your refactorings of NamingTable. I've kept the same sequence so I hope you don't have too much work. Ideally you could just keep your own code and the diff to the existing would now be much smaller.

Jun 25 '24 12:06 THausherr

Why is readExact() needed under the hood? Assuming that your changes in this PR read a selection of the data that is read in the production code, the error handling on truncated / corrupted files can stay the same. Unless you think that the error handling is poor, then it should be changed everywhere. The reason I ask this is because your read() has a different behavior than other read() methods, and because I'd like to keep the code as small as possible.

Jun 25 '24 12:06 THausherr

Thank you, I will fix the issues today.

Re readExact()/readUpTo(): I will remove them for simplicity, they remained from my initial PR.

I initially saw that RandomAccessRead.read() was called in a loop and assumed that it behaves like InputStream.read() (i.e. can return <N bytes and still have more coming), so it needs to be wrapped into readUpTo(). But now I see that all implementations of RandomAccessRead.read() have loops of their own, so they already behave like InputStream.readNBytes() (can only read less than N bytes in case of EOF)? But this is not specified in Javadoc.

For the future refactoring, here are the places where read() is called in a loop (either shrink them to 1 call or add readUpTo() that guarantees ==N bytes || EOF):

RandomAccessReadDataStream.cctor()
COSParser.getStartxrefOffset() (but also throws if (readBytes < 1))
PDFXrefStreamParser.readNextValue()
PDCIDFontType2.getParser()
PDTrueTypeFont.getParser()
PDFontFactory.getFontHeader()
CFFParser::parse

No-loop calls (may be wrong if implementations are not required to behave like InputStream.readNBytes()):

BaseParser.checkForEndOfString()
PdfStreamParser.hasNoFollowingBinData()

Jun 25 '24 14:06 bogdiuk

Thanks; The memory leak part has been fixed in https://issues.apache.org/jira/browse/PDFBOX-5845 . I've removed one of the constructors because it's no longer needed, please update your PR.

Jun 26 '24 10:06 THausherr

I applied the patch for the first time, I do get differences in the pdfbox.cache file: ARBONNIE becomes ARBON ComicSansMS-BoldItalic becomes Com GeorgiaPro-CondSemiboldItalic becomes Georgi NotoNaskhArabicUI-Bold becomes NotoNaskhArabicUI- KodchiangUPCBoldItalic becomes Kodchi

Here's the comicz file: comicz.zip

However it might also be my fault due to my own changes, so I'll investigate.

Jun 27 '24 10:06 THausherr

Yes, I have the same Com, will investigate. Thanks for noticing I compared caches only on macOS initially

Jun 27 '24 10:06 bogdiuk

Found it: RandomAccessReadUnbufferedDataStream.read() in my initial code was using readExact() and thus returned length. After reverting to read(), I forgot to return the correct read amount

Jun 27 '24 11:06 bogdiuk

Thanks that works now.

Jun 27 '24 11:06 THausherr

Why the setException() / getException() thing? It feels weird to me because parse() might still throw an exception so why swallow and store it if it happens later? Same weird feeling that an exception for "font.getNaming()" is completely ignored, I don't understand the comment. (Sorry if you did answer this before, I did reread the conversation, I hope I didn't miss something)

Jun 27 '24 13:06 THausherr

Re exception: 2 calls should've been setError(String) to avoid slow fillInStackTrace() in the exception constructor. Third call removed because existing exceptions should be forwarded indeed.

Re "font.getNaming()" - don't remember, looks weird, removed.

Found a potential bug in TrueTypeCollection.processAllFontHeaders() and processAllFonts(): if some font in TTC file is corrupted, an exception is thrown and all remaining fonts in TTC are skipped. Didn't fix because it's debatable (maybe whole TTC is corrupted, there's no need to waste time on parsing it).

Also, simplified addTrueTypeCollection() because all 3 methods it calls do not effectively throw.

Jun 27 '24 20:06 bogdiuk

Thanks; IOUtils.closeQuietly(parser.parse(new RandomAccessReadBufferedFile(ttfFile))); caught my attention because parser.parse() returns null in "header" mode. So the closeQuietly() isn't needed. And returning sometimes a font and sometimes null kindof contradicts the "do one thing" ideal. How about making a different method for the headers instead, which returns a FontHeader object, and refactor the common code of the two methods into a third one? Same for getFontAtIndex, IMHO this should be split in getFontAtIndex and getFontHeaderAtIndex and the common part refactored into a method accessed by both, this would be more clear.

Jun 29 '24 10:06 THausherr

While I'm at it, I think it's better to replace "calling getter just for side-effects"

font.getNaming(); // calls NamingTable.readTable();
font.getHeader(); // calls HeaderTable.readTable();
((OpenTypeFont) font).getCFF(); // calls CFFTable.readTable();

with explicit calls

font.readTableHeaders(*.TAG, outHeaders); // -> TTFTable.readHeaders(..., FontHeaders)

Then TTFParser.loadOnlyHeaders field can be removed.

The changeset will become a bit larger (like, duplicated TrueTypeFont.readTable()), but codepaths will be more clear

Jun 30 '24 01:06 bogdiuk

Or did I over-do it with CFFParser.parseFirstSubFontROS()?

Jun 30 '24 01:06 bogdiuk

I'm wondering if the onlyHeaders boolean parameter in NamingTable.read() is needed. At a first look, the only thing it does is to decide whether nameRecords is filled. When having a second look, I think it will result in more computing at the end of the method, but not more file reading, isn't it? (And we could more the 3 lines into read(TrueTypeFont ttf, TTFDataStream data) )

Jun 30 '24 12:06 THausherr

When onlyHeaders==true, it discards useless entries, making nameRecords smaller and thus readString() is called fewer times in the following loop, reducing both computing and file reading.

So it's slightly more computing in the 1st loop (isUsefulForOnlyHeaders() is just 5 comparisons), but much less computing in the second (seek > allocate+read > detect charset > new String) and fillLookupTable() that traverse nameRecords

Of course, to please branch predictor, we can extract if from the loop, but it looks much more verbose and it should be easy to predict for CPU anyway:

    if (onlyHeaders)
    {
        for (int i=0; i< numberOfNameRecords; i++)
        {
            NameRecord nr = new NameRecord();
            nr.initData(ttf, data);
            if (isUsefulForOnlyHeaders(nr))
            {
                nameRecords.add(nr);
            }
        }
    } else {
        for (int i=0; i< numberOfNameRecords; i++)
        {
            NameRecord nr = new NameRecord();
            nr.initData(ttf, data);
            nameRecords.add(nr);
        }
    }

Which 3 lines do you mean?

Jun 30 '24 13:06 bogdiuk

As for lookupTable, I am going to remove in the the future PR (the change was in my initial Request). By sorting nameRecords, we can use a simple binary search in getName(), which is fast enough, and doesn't need additional memory, like current 4-level HashMap cache. Especially considering that each name is searched only once

Jun 30 '24 13:06 bogdiuk

Please ignore all I wrote today for now, I overlooked the readstring part. Don't change anything.

Jun 30 '24 14:06 THausherr

Thank you for your work and your patience. I'm planning to commit all next week, as soon as time permits. Ideally I'll commit to the trunk and all branches. From what I see you didn't break the API.

However there's one more thing I have to investigate, I'm getting sporadic parsing errors with a specific file, when the cache is getting rebuilt. Most likely this isn't because of the changes but I'll check that first.

Update: not because of the changes, problem happens in 3.0.2 release too.

Jun 30 '24 14:06 THausherr

trunk and 3.0 have been committed. 2.0 is still todo. Builds will freeze, this is a problem with the OWASP plugin, search for "owasp" in the top pom.xml, then below "" add "true". The people maintaining this plugin are aware of the problem and I'll update the version as soon as their release is done.

https://issues.apache.org/jira/browse/PDFBOX-5847

update: OWASP update has been released, change has been committed to the pom.xml

Jul 01 '24 11:07 THausherr

I worked on 2.0 for a few hours and I was only able to do 4 files in that time, and not the big ones. In 2.0 we read the fonts into memory using the MemoryTTFDataStream class. RandomAccessRead isn't available in fontbox. So unless we would add double code that inflates the jar file, our advantage wouldn't be as much as in 3.0 and higher, we'd only save some reading from memory. Even if we did have some extra io classes, the advantage wouldn't be as much as in 3.0 because in 2.0 we don't read so many GSUB tables.

Opinions?

Jul 01 '24 12:07 THausherr

Thank you. I will look at 2.0 on the weekend or maybe next week

Jul 02 '24 04:07 bogdiuk

Thank you; here's the work I did, maybe it's useful. ttf.zip cff.zip

Jul 05 '24 11:07 THausherr

pdfbox pdfbox copied to clipboard

Improve performance of FileSystemFontProvider.scanFonts()

pdfbox
pdfbox copied to clipboard