Add `normalize_unicode=False/True` parameter to text extraction methods
Per @petermr's suggestion in https://github.com/jsvine/pdfplumber/discussions/904#discussioncomment-6149469, I think it's a good idea to add such a parameter/option, using unicodedata.normalize(...) — in a similar vein to the expand_ligatures parameter added in v0.9.0. I'll look into this.
Some useful reference links, as a note-to-self:
- https://docs.python.org/3/library/unicodedata.html
- https://stackoverflow.com/questions/9175073/convert-hexadecimal-character-ligature-to-utf-8-character
- https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c
Hi @jsvine, is there a workaround for this in the meantime?
Can I manually apply a normalize function to all text in the PDF?
Hi @agusluques, and thanks for checking. There have not been any updates on this, but there may still be a solution for certain use-cases. What's your particular use-case?
@jsvine thanks for the answer. Basically, I am trying to do some split by ; (U+003B) but the PDF seems to have a different ; (U+037E). I am doing some manual replacement but it will be great to have this at the moment of reading the PDF so I don't have any point of risk in case I forget to include the cleaning logic
The definitive rules are defined in the Unicode spec ( https://unicode.org/reports/tr15/). It needs careful reading ("Taken step-by-step, the Unicode Normalization Algorithm is fairly complex"). It specifically discusses the Greek question mark. There are different formal approaches
The four Unicode Normalization Forms are summarized in Table 1.
Table 1. Normalization Forms https://unicode.org/reports/tr15/#Normalization_Forms_Table FormDescription Normalization Form D (NFD) Canonical Decomposition Normalization Form C (NFC) Canonical Decomposition, followed by Canonical Composition Normalization Form KD (NFKD) Compatibility Decomposition Normalization Form KC (NFKC) Compatibility Decomposition, followed by Canonical Composition
===== 10 Respecting Canonical Equivalence https://unicode.org/reports/tr15/#Canonical_Equivalence
This section describes the relationship of normalization to respecting (or preserving) canonical equivalence. A process (or function) respects canonical equivalence when canonical-equivalent inputs always produce canonical-equivalent outputs. For a function that transforms one string into another, this may also be called preserving canonical equivalence. There are a number of important aspects to this concept:
- The outputs are not required to be identical, only canonically equivalent.
- Not all processes are required to respect canonical equivalence.
For example:
- A function that collects a set of the General_Category values present in a string will and should produce a different value for <angstrom sign, semicolon> than for <A, combining ring above, greek question mark>, even though they are canonically equivalent.
- A function that does a binary comparison of strings will also find these two sequences different.
- Higher-level processes that transform or compare strings, or that perform other higher-level functions, must respect canonical equivalence or problems will result.
<<< It's important we adhere precisely to Unicode terminology and philosophy
For me (a crystallographer) it's the equivalence between Aring and Angstrom (which are frequently misused. Note that Aring if further complicated and may have to be normalised 0041 (A) + 030A (combining ring) => 00C5 (Aring)
The problems frequently arise when authors pick symbols from menus without realising what character results.
There are a lot of further illiteracies which probably can't be dealt with, e.g. em-dash for minus
On Tue, Jul 16, 2024 at 2:23 PM Agus Luques @.***> wrote:
@jsvine https://github.com/jsvine thanks for the answer. Basically, I am trying to do some split by ; (U+003B) but the PDF seems to have a different ; (U+037E). I am doing some manual replacement but it will be great to have this at the moment of reading the PDF so I don't have any point of risk in case I forget to include the cleaning logic
— Reply to this email directly, view it on GitHub https://github.com/jsvine/pdfplumber/issues/905#issuecomment-2230882359, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS2BQYIOJARAT3TN5ULZMUNGHAVCNFSM6AAAAABKVC3SXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZQHA4DEMZVHE . You are receiving this because you were mentioned.Message ID: @.***>
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Feature now added in https://github.com/jsvine/pdfplumber/commit/03a477f7f0b3016dd38d00f9e24d0cc5925d5a04
On the develop branch, you should be able to run pdfplumber.open(..., unicode_norm="NFC"), where that latter argument can be any of the abbreviations for the four normalization forms.
Give it a whirl and let me know if it suits your needs / meets your expectations?