podofo icon indicating copy to clipboard operation
podofo copied to clipboard

Bad CMap can cause extremely slow load

Open ccreutzi opened this issue 1 year ago • 1 comments

Loading longCMap.pdf causes the loop in PdfCMapEncoding.cpp line 347 to run for 230891731 steps, which takes an unknown amount of time, aborted after 20 minutes on my machine. That could lead to denial-of-service attacks.

It seems like switching from using CodeUnitMap = std::map<PdfCharCode, std::vector<codepoint>> to using CodeUnitMap = std::unordered_map<PdfCharCode, std::vector<codepoint>> in PdfCharCodeMap might help improve the speed a little bit, but it's still not great.

Alternatively, using our knowledge about the input order might help, given that we know the second, third, … entries need to be sorted right behind the previous entry in the std::map, using a placement hint iterator in map.insert. That would require relatively large changes to the code architecture.

Or there may be a much better way of dealing with such inputs, such as deferring the creation of the cmap entries until they are requested or something like that. I know neither if that sort of thing is done anywhere in PoDofo already nor if it is worth doing for this type of input.

Clearly, such cmap ranges could appear in legal PDF files, but I do not expect that in practice.

ccreutzi avatar Jul 05 '24 07:07 ccreutzi

Providing we can sort ranges as a first step, insertion with hints may be an idea to evaluate and we could have a private API for that. Public API for PdfCharCodeMap can be kept the same, as it's used elsewhere where the ordering is not (or less) predictable. But first I would also investigate what other libraries do in this case, with pdfium and pdf.js being the more accessible ones. We could also throw for too big maps, as these won't be able to map that many CIDs to different glyphs anyway. The limit for glyphs in a font is always 2^16 even for modern OpenType fonts, and even if different CIDs may map to the same glyph, I would not allow the map to be too big anyway. I'm too busy to do this scouting and I'm working very hard to finish the API review for 1.0 . If you can invest some of your time into this that would be very welcome, and could be a post 1.0 fix.

ceztko avatar Jul 17 '24 21:07 ceztko