pdfdiff
pdfdiff copied to clipboard
Improve ligature handling by switching to unicode normalization
Currently, pdfdiff.py handles ligatures by considering a hard-coded set of ligature encodings. This set is incomplete and depends on the encoding of the input.
Ideally, we detect the encoding of text input files (this is hard, but one can consider, e.g. using the chardet libs) and then use python's codecs module to normalize the input, in particular for ligatures.