podofo icon indicating copy to clipboard operation
podofo copied to clipboard

Extract PDF file results in a garbled code

Open tayei1997 opened this issue 1 year ago • 8 comments

Hello, I am using podofo library provides pdf text extraction function, encountered a garbled problem: I use podofo to extract sample1.pdf, the results of the console outputs: “Updating version from 1.7 to 1.7
VARNING: Unable to find font object F1 WARNING: Unable to provide a space size, setting default font size WARNING: Unable to find font object F1 WARNING: Unable to provide a space size, setting default font size VARNING: Unable to find font object F1 WARNING: Unable to provide a space size, setting default font size WARNING: Unable to find font object F1 VARNING: Unable to provide a space size, setting default font size VARNING: Unable to find font object F2 VARNING: Unable to provide a space size, setting default font size WARNING: Unable to find font object F1 ......” sample1.pdf podofo unable find object F1

I checked sample1.pdf and it doesn't seem to have a ToUnicode map, could this be the cause of this? 1111 The image below shows the extracted garbled text. 企业微信截图_17103151107008

I noticed that the extracted garbled text and PDF file content data stream, Tj keyword before the content is similar, podofo will not be able to find the font will be the pdf text string output as is?

tayei1997 avatar Mar 13 '24 07:03 tayei1997

Hello, there may be multiple problems but I think the most relevant one is is related to missing predefined CMaps encoding support, for which I just added an entry in the TODO (but it was a known missing feature). Embedding all the known predefined CMaps, which I believe should be in this repository, is quite a big task. The main idea I have about this task would be to parse all the cmaps inside PdfCharCodeMap (the code that does the CMap parsing is here, but may be refactored so it's callable somewhere else) and make some kind of binary serialization so it can be efficiently embedded in PoDoFo. Then, implement the predefined cmap resolution algorithm as told in the specification and I believe this problem should be addressed. I would be very glad to see some contributions on this matter, as I definitely don't have time to put on the task, I'm sorry.

ceztko avatar Mar 16 '24 11:03 ceztko

Thanks for your answer.

So the problem now is that podofo can extract the binary encoded data from the text in the image below, but, due to the lack of a corresponding CMap, it cannot decode the text correctly. 企业微信截图_17107287983628 If I want to decode it myself, I first need to get the pre-TJ data, can I use podofo to get the pre-TJ data?

tayei1997 avatar Mar 18 '24 02:03 tayei1997

You can have a look at the use of PdfContentStreamReader here. But this project would benefit if you try to implement the system I suggested and do it within PoDoFo source (at least a prototype of it in a fork). I recently received some very good contributions from a couple of Chinese users that had issues trying to draw text: I enjoyed the level of competence and their PRs have been already merged.

ceztko avatar Mar 18 '24 08:03 ceztko

Hello, there may be multiple problems but I think the most relevant one is is related to missing predefined CMaps encoding support, for which I just added an entry in the TODO (but it was a known missing feature). Embedding all the known predefined CMaps, which I believe should be in this repository, is quite a big task. The main idea I have about this task would be to parse all the cmaps inside PdfCharCodeMap (the code that does the CMap parsing is here, but may be refactored so it's callable somewhere else) and make some kind of binary serialization so it can be efficiently embedded in PoDoFo. Then, implement the predefined cmap resolution algorithm as told in the specification and I believe this problem should be addressed. I would be very glad to see some contributions on this matter, as I definitely don't have time to put on the task, I'm sorry.

I have the will to carry out the development, so I would like to ask a few questions:

  1. if all the CMap mappings are embedded in podofo, will it cause the memory usage of podofo to become higher?
  2. I lack the CMap related development experience you mentioned, so it is difficult to estimate the time needed for this work, how many man-days do you think it will take to complete this work?

tayei1997 avatar Mar 19 '24 01:03 tayei1997

Hello. Sorry, for the delay in the answer, it took me some more time to do further analysis. First, let me confirm that the issue here is really the missing embedding of predefined CMaps encoding. I try to answer your questions below.

  1. if all the CMap mappings are embedded in podofo, will it cause the memory usage of podofo to become higher?

Yes, but there are options to reduce the memory consumption by embedding pre-parsed maps. See below.

I lack the CMap related development experience you mentioned, so it is difficult to estimate the time needed for this work, how many man-days do you think it will take to complete this work?

It's hard, but let's try to create some tasks and (possibly over-)estimate them:

  1. [4 Hours] Factorize CMap parsing code so it can be used to make a tool to bulk parse many cmaps;

  2. [4 Hours] Make PdfCharCodeMap to be initialized from a CodeUnitMap. This may remove the need of defining binary serialization of the map, as I was suggesting before. Basically you can make a constructor of PdfCharCodeMap like the following:

PdfCharCodeMap(CodeUnitMap&& codeUnitMap);

Which you can use to define many singletons like the following:

static const PdfCharCodeMap& GetInstance_UniGB_UCS2_H()
{
    static PdfCharCodeMap UniGB_UCS2_H(CodeUnitMap({
        { PdfCharCode(32, 2), { 1 } },
        { PdfCharCode(33, 2), { 2 } },
        { PdfCharCode(34, 2), { 3 } }
        // ..
        }));

    return UniGB_UCS2_H;
}
  1. [8 Hours] Make a tool that will do the parsing of the CMap and create the singletons above in many .cpp files.
  2. [8 Hours] Create a script Run the tool above on the existing CMaps from cmap-resources and mapping-resources-pdf repositories (both should be needed, in 2 steps).
  3. [32 Hours] Implement the algorithm described in "9.10.2 Mapping character codes to Unicode values" below:

If the font is a composite font that uses one of the predefined CMaps listed in "Table 116 - Predefined CJK CMap names" (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, Adobe-Korea1 (deprecated in PDF 2.0 (2020)) or Adobe-KR (added in PDF 2.0 (2020)) character collection: a. Map the character code to a character identifier (CID) according to the font’s CMap. b. Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary. c. Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry-ordering-UCS2 (for example, Adobe–Japan1–UCS2). d. Obtain the CMap with the name constructed in step (c) (available from a variety of online sources, e.g. https://github.com/adobe-type-tools/mapping-resources-pdf). e. Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value. Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, Adobe-Korea1 (deprecated in PDF 2.0 (2020)) or Adobe-KR (added in PDF 2.0 (2020)) character collection (as specified in the CIDSystemInfo dictionary) shall have a supplement number corresponding to the version of PDF supported by the PDF processor.

Translated in PoDoFo architecture, I believe one PdfEncoding instance has to be constructed from the embedded maps, recognizing the /Encoding entry is one of the predefined names. I believe the code to cid CMap encoding that must be used in point a. is to be found in cmap-resources, while the "toUnicode" CMap needed in step .d is to be found in mapping-resources-pdf. You then constructor an instance like PdfEncoding(cidMap, toUnicode) (the name detection and instance construction should be probably inserted at this location in the source) and text extraction should start to work.

Summarizing, I believe 7-8 man days may be a decent estimation of the work need to accomplish the task. Following the above approach would make me more willing to fast track review/merge a prototype solving the problem. The more the approach differs, the less I may be comfortable at reviewing your work.

ceztko avatar Mar 20 '24 17:03 ceztko

Have you considered whether you are willing to implement the above activities? 7-8 days may be larger estimate and if you are quick enough it could be shorter (but remember I would like to see few unit tests as well for this work).

ceztko avatar Mar 21 '24 09:03 ceztko

Hi, I have the intention of completing the above functionality, but I must state that as I can only develop the relevant code outside of my official working hours, and due to my lack of experience in this development, I cannot offer a guarantee as to the time of completion.

tayei1997 avatar Mar 21 '24 11:03 tayei1997

Ok. I'm sorry for the unsolicited advice: I don't know what's your job, but in the case a company is paying you to work on PDF related topics still I recommend you to not work out of official hours if the work ultimately benefits them. In this way companies using open source software "for free" get more responsible , and the actual software improves in a more professional way.

ceztko avatar Mar 21 '24 12:03 ceztko

No news from you, so I'm here just to say that I began working on this https://github.com/podofo/podofo/commit/4c16a52fe94024e52e844a1bd7cd7e4f0fccc06b .

ceztko avatar Sep 12 '24 17:09 ceztko

I got this done. The estimates were basically correct, with the major issue that the PdfCharCodeMap class was missing support for range mappings, meaning that all predefined cmaps should have been unrolled into flat hash table mappings, which were too many to be effectively serialized because most of them are very big 2 bytes encodings. This is also an improvement in general since range mappings are faster when internally constructing the maps, and actual char code lookup time is very similar. This enlarges the final binary size of PoDoFo a bit but the aim of the library is being compatible with the ISO specification and that includes the external resources as well such as predefined CMaps and standard fonts.

ceztko avatar Sep 25 '24 23:09 ceztko

Great work, I tested the files from both sources and they can be extracted normally.

By the way, I noticed that this change was made after version 0.10.4, which means this change will be released in 0.10.5?

tayei1997 avatar Oct 11 '24 03:10 tayei1997

Hello. No, it will be released for in 1.0.0, which is close but not imminent (1-2 months). 0.10.x series is reserved for critical bugs only.

ceztko avatar Oct 11 '24 07:10 ceztko