earwigbot icon indicating copy to clipboard operation
earwigbot copied to clipboard

reverse Hebrew text in PDFs

Open eranroz opened this issue 8 years ago • 0 comments

Using Earwig's Copyvio Detector in lab with a Hebrew text PDF resulted in reversed order of characters within words, e.g the correct text is more or less [word[::-1] for word in words] :) For ease of debug: Even Latin script (URLs / emails) may appear as reversed within this PDF.

Input PDF: http://img2.tapuz.co.il/CommunaFiles/53173603.pdf (query: http://tools.wmflabs.org/copyvios/?lang=he&project=wikipedia&title=%D7%A0%D7%99%D7%AA%D7%95%D7%97+%D7%91%D7%A8%D7%99%D7%90%D7%98%D7%A8%D7%99&oldid=&use_engine=0&use_links=0&turnitin=0&action=compare&url=http%3A%2F%2Fimg2.tapuz.co.il%2FCommunaFiles%2F53173603.pdf )

Relevant code (PDF parser, using pdfminer): earwigbot/wiki/copyvios/parsers.py This may be a upstream issue in pdfminer, or something wrong with the decoding.

eranroz avatar Jun 17 '16 05:06 eranroz