pdfbox
pdfbox copied to clipboard
Improve performance of FileSystemFontProvider.scanFonts()
On one machine with 792 fonts (~800MB) scanning takes around 5.5s, after these changes time is down to 0.7s (without checksumming), or 2.2s (with checksums).
In this PR, TTF parsers have an "only headers" mode where each table reads as little information as possible:
- in this mode, whole file is not read to
byte[]
, parser usesRandomAccessRead
directly, because most of the file is skipped - only read 5 tables needed for
FSFontInfo
(name
,head
,OS/2
,CFF
,gcid
) - table parsers finish as soon as they have all needed data
- skip checksumming because it is now faster to simply re-parse the file (gated with
pdfbox.fontcache.skipchecksums
system property, for backward compatibility): checksumming 800MB takes 1.5s and parsing headers takes only 0.7s
Additional fixes:
- fixed a memory leak:
RandomAccessRead
passed toTrueTypeCollection
constructor was never freed. - ~
NamingTable
: use sorted list instead of multilevelHashMap
, delay-load Strings (for non-"only headers" mode)~ - ~
TTFSubsetter
: avoid bytes->string->bytes conversion~ - ~streamline I/O: replace readByte() with read(array)~
- ~consolidate
read(buf, offset, len)
loops intoreadNBytes()
(allows underflow) andreadExact()
(throwing)~
Breaking changes:
- if
pdfbox.fontcache.skipchecksums
property is set,.pdfbox.cache
file has dashes instead of checksums -
.pdfbox.cache
file differs for corrupted fonts (likeLastResort.otf
that throwsIOException: Invalid character code 0xD800
):- original cache lists it as
*skipexception*|OTF||0|0|0|0|0||/System/Library/Fonts/LastResort.otf|6c14c91|1715065304000
- after this change it's listed as
LastResort|OTF||||0|0|0||/System/Library/Fonts/LastResort.otf|6c14c91|1715065304000
because we simply do not parse it as extensively to recognize that it's corrupted
- original cache lists it as
- ~
NameRecord.getString()
is now package-private and lazy, renamed togetStringLazy()
~ - ~new abstract method
TTFDataStream.createSubView()
~ (used a default implementation instead)
Only tested with 3.0 branch, all tests pass, resulting file is the same (except for checksums field, of course, and some *skipexception*
s).
Should I break it into smaller commits?
Thanks for the contribution, looks promising at least at a first sight.
And yes, please split up the patch im smaller parts so that it is easier to check the changes. We won't add breaking changes to 3.x, those are limited to the trunk. But maybe we find a way to backport at least some of them without breaking the api.
Hi, any update on this? Looks like a worthwhile enhancement (as far as I can tell).
I'm sorry, I didn't have much free time, but refactoring is in progress. I am shrinking changes to only apply to LoadOnlyHeaders case. As for those changes that affect both LoadOnlyHeaders and non-LoadOnlyHeaders parsers - I will push them separately, for easier testing. When done, I will post performance numbers for this subset of changes.
Force-pushed the new implementation, based on 3.0 branch (rev 5c4a09a5c0574c1701151202ede2be1857610a36). Also I edited the 1st post to reflect changes
Performance numbers: first number is a re-run (font files are cached by the OS), second is a cold run
Windows (296 fonts, Java 8 x32):
- 2.9s..4.0s -
3.0
branch - 0.8s..1.9s -
performance-scanfonts
branch - 0.3s..0.6s -
performance-scanfonts
+"pdfbox.fontcache.skipchecksums=true"
macOS (792 fonts, Java 22 arm64):
- 2.2..3.6s -
3.0
branch - 0.4..0.5s -
performance-scanfonts
branch - 0.16s..0.2s
performance-scanfonts
+"pdfbox.fontcache.skipchecksums=true"
Reasoning behind using a separate class LoadOnlyHeaders
instead of extracting information directly from TrueTypeFont
:
- easier to debug - single place to put breakpoints
- easier to refactor:
- for example, simple "Find All Usages" can reveal that
setSomething()
is not used - the alternative is having
boolean isLoadOnlyHeaders
inTTFParser
,CFFParser
andTrueTypeFont
: it would be harder to collect usages of all 3
- for example, simple "Find All Usages" can reveal that
- in the future, parsing some tables can be skipped completely, only setting respective fields of
LoadOnlyHeaders
Could you please rename "LoadOnlyHeaders" to something that isn't a verb? E.g. "FontHeaders" or whatever.
Logging should be done like it's done in the code, not with "Logger.getLogger".
I'd also like to separate refactorings and bug fixes from the main PR, I'll do some small fixes. Please rebase your PR when you notice it.
Re "fixed a memory leak", this is tricky: your change makes sense for the TrueTypeCollection(File file)
constructor only. Not for the one with InputStream. That one (which isn't used by PDFBox itself) also reads the font into memory twice as it is now.
I've done your refactorings of NamingTable. I've kept the same sequence so I hope you don't have too much work. Ideally you could just keep your own code and the diff to the existing would now be much smaller.
Why is readExact() needed under the hood? Assuming that your changes in this PR read a selection of the data that is read in the production code, the error handling on truncated / corrupted files can stay the same. Unless you think that the error handling is poor, then it should be changed everywhere. The reason I ask this is because your read() has a different behavior than other read() methods, and because I'd like to keep the code as small as possible.
Thank you, I will fix the issues today.
Re readExact()
/readUpTo()
: I will remove them for simplicity, they remained from my initial PR.
I initially saw that RandomAccessRead.read()
was called in a loop and assumed that it behaves like InputStream.read()
(i.e. can return <N
bytes and still have more coming), so it needs to be wrapped into readUpTo()
.
But now I see that all implementations of RandomAccessRead.read()
have loops of their own, so they already behave like InputStream.readNBytes()
(can only read less than N bytes in case of EOF)? But this is not specified in Javadoc.
For the future refactoring, here are the places where read()
is called in a loop (either shrink them to 1 call or add readUpTo()
that guarantees ==N bytes || EOF
):
-
RandomAccessReadDataStream.cctor()
-
COSParser.getStartxrefOffset()
(but also throwsif (readBytes < 1)
) -
PDFXrefStreamParser.readNextValue()
-
PDCIDFontType2.getParser()
-
PDTrueTypeFont.getParser()
-
PDFontFactory.getFontHeader()
-
CFFParser::parse
No-loop calls (may be wrong if implementations are not required to behave like InputStream.readNBytes()
):
-
BaseParser.checkForEndOfString()
-
PdfStreamParser.hasNoFollowingBinData()
Thanks; The memory leak part has been fixed in https://issues.apache.org/jira/browse/PDFBOX-5845 . I've removed one of the constructors because it's no longer needed, please update your PR.
I applied the patch for the first time, I do get differences in the pdfbox.cache
file:
ARBONNIE becomes ARBON
ComicSansMS-BoldItalic becomes Com
GeorgiaPro-CondSemiboldItalic becomes Georgi
NotoNaskhArabicUI-Bold becomes NotoNaskhArabicUI-
KodchiangUPCBoldItalic becomes Kodchi
Here's the comicz file: comicz.zip
However it might also be my fault due to my own changes, so I'll investigate.
Yes, I have the same Com
, will investigate. Thanks for noticing
I compared caches only on macOS initially
Found it: RandomAccessReadUnbufferedDataStream.read()
in my initial code was using readExact()
and thus returned length
. After reverting to read()
, I forgot to return the correct read amount
Thanks that works now.
Why the setException() / getException() thing? It feels weird to me because parse() might still throw an exception so why swallow and store it if it happens later? Same weird feeling that an exception for "font.getNaming()" is completely ignored, I don't understand the comment. (Sorry if you did answer this before, I did reread the conversation, I hope I didn't miss something)
Re exception: 2 calls should've been setError(String)
to avoid slow fillInStackTrace()
in the exception constructor. Third call removed because existing exceptions should be forwarded indeed.
Re "font.getNaming()" - don't remember, looks weird, removed.
Found a potential bug in TrueTypeCollection.processAllFontHeaders()
and processAllFonts()
: if some font in TTC file is corrupted, an exception is thrown and all remaining fonts in TTC are skipped. Didn't fix because it's debatable (maybe whole TTC is corrupted, there's no need to waste time on parsing it).
Also, simplified addTrueTypeCollection()
because all 3 methods it calls do not effectively throw.
Thanks; IOUtils.closeQuietly(parser.parse(new RandomAccessReadBufferedFile(ttfFile)));
caught my attention because parser.parse()
returns null in "header" mode. So the closeQuietly() isn't needed. And returning sometimes a font and sometimes null kindof contradicts the "do one thing" ideal. How about making a different method for the headers instead, which returns a FontHeader object, and refactor the common code of the two methods into a third one? Same for getFontAtIndex
, IMHO this should be split in getFontAtIndex
and getFontHeaderAtIndex
and the common part refactored into a method accessed by both, this would be more clear.
While I'm at it, I think it's better to replace "calling getter just for side-effects"
font.getNaming(); // calls NamingTable.readTable();
font.getHeader(); // calls HeaderTable.readTable();
((OpenTypeFont) font).getCFF(); // calls CFFTable.readTable();
with explicit calls
font.readTableHeaders(*.TAG, outHeaders); // -> TTFTable.readHeaders(..., FontHeaders)
Then TTFParser.loadOnlyHeaders
field can be removed.
The changeset will become a bit larger (like, duplicated TrueTypeFont.readTable()
), but codepaths will be more clear
Or did I over-do it with CFFParser.parseFirstSubFontROS()
?
I'm wondering if the onlyHeaders
boolean parameter in NamingTable.read()
is needed. At a first look, the only thing it does is to decide whether nameRecords is filled. When having a second look, I think it will result in more computing at the end of the method, but not more file reading, isn't it? (And we could more the 3 lines into read(TrueTypeFont ttf, TTFDataStream data)
)
When onlyHeaders==true
, it discards useless entries, making nameRecords
smaller and thus readString()
is called fewer times in the following loop, reducing both computing and file reading.
So it's slightly more computing in the 1st loop (isUsefulForOnlyHeaders()
is just 5 comparisons), but much less computing in the second (seek > allocate+read > detect charset > new String) and fillLookupTable()
that traverse nameRecords
Of course, to please branch predictor, we can extract if
from the loop, but it looks much more verbose and it should be easy to predict for CPU anyway:
if (onlyHeaders)
{
for (int i=0; i< numberOfNameRecords; i++)
{
NameRecord nr = new NameRecord();
nr.initData(ttf, data);
if (isUsefulForOnlyHeaders(nr))
{
nameRecords.add(nr);
}
}
} else {
for (int i=0; i< numberOfNameRecords; i++)
{
NameRecord nr = new NameRecord();
nr.initData(ttf, data);
nameRecords.add(nr);
}
}
Which 3 lines do you mean?
As for lookupTable
, I am going to remove in the the future PR (the change was in my initial Request).
By sorting nameRecords
, we can use a simple binary search in getName()
, which is fast enough, and doesn't need additional memory, like current 4-level HashMap cache. Especially considering that each name is searched only once
Please ignore all I wrote today for now, I overlooked the readstring part. Don't change anything.
Thank you for your work and your patience. I'm planning to commit all next week, as soon as time permits. Ideally I'll commit to the trunk and all branches. From what I see you didn't break the API.
However there's one more thing I have to investigate, I'm getting sporadic parsing errors with a specific file, when the cache is getting rebuilt. Most likely this isn't because of the changes but I'll check that first.
Update: not because of the changes, problem happens in 3.0.2 release too.
trunk and 3.0 have been committed. 2.0 is still todo. Builds will freeze, this is a problem with the OWASP plugin, search for "owasp" in the top pom.xml, then below "
https://issues.apache.org/jira/browse/PDFBOX-5847
update: OWASP update has been released, change has been committed to the pom.xml
I worked on 2.0 for a few hours and I was only able to do 4 files in that time, and not the big ones. In 2.0 we read the fonts into memory using the MemoryTTFDataStream
class. RandomAccessRead
isn't available in fontbox. So unless we would add double code that inflates the jar file, our advantage wouldn't be as much as in 3.0 and higher, we'd only save some reading from memory. Even if we did have some extra io classes, the advantage wouldn't be as much as in 3.0 because in 2.0 we don't read so many GSUB tables.
Opinions?
Thank you. I will look at 2.0 on the weekend or maybe next week