Improve ligature handling by switching to unicode normalization

Open cascremers opened this issue 12 years ago • 0 comments

Currently, pdfdiff.py handles ligatures by considering a hard-coded set of ligature encodings. This set is incomplete and depends on the encoding of the input.

Ideally, we detect the encoding of text input files (this is hard, but one can consider, e.g. using the chardet libs) and then use python's codecs module to normalize the input, in particular for ligatures.

Jun 19 '13 10:06 cascremers