textract Arabic characters not extracting from PDF

Arabic characters not extracting from PDF

Open Elzoka opened this issue 6 years ago • 7 comments

It would be great if the newline at least as a new line escape character. Also, is there any support for non-latin characters (Arabic) or mixed types of characters.

Jun 05 '18 11:06 Elzoka

Did you use preserveLineBreaks?

Jun 05 '18 12:06 dbashford

Also, yes, there is support for groups of characters, if something is missing, raise another issue.

Jun 05 '18 12:06 dbashford

@dbashford line breaks works fine after adding preserveLineBreaks, thank you. but when converting arabic docs, the functions gives weird results

"١\np\n\nÍ\nm\n¿\nÀ\nÖ\nm\n¿\n\n\nÌ\nÑ\nm\n¿\nm\nu 1\nThis sample shows a mixture of Arabic and English. Arabic is the\nmain script. There are running heads, too.\n\bodydir apparently can be used only in the preamble of the doc-\nument. \textdir should reverse \hboxes with TLT, but for some\nreason in Aleph does not work (Omega reverses hboxes correctly).\nThis is important in lists and sections. Currently, the \bodydir is\nthat of the main language (the last one in the package options), which\nmeans the document has a fixed TLT layout.\np\n\nÍ\nm\n¿\nÀ\nÖ\nm\n¿\n\n\nÌ\nÑ\nm\n¿\nm\nu\nHello\nÈ\nÚ\nÌ\n¼\nÑ\n³\nÛ\nÖ\nm\nu\nm\n×\nm\nÈ\n\nØ\nm\n¾\np\nÌ\nn\nÚ\n\nm\n\n\nq\n¹\nm\n×\nm\nÏ\nw\n¸\nn\n¾\nÈ\n\n\nÏ\n´\n\nm\nÈ\nË\n\n×\nÈ\n\n\nm\n¿\n´\n\n\n×\nm\nÈ\nË\n¼\nn\nÎ\n×\nm\nÈ\n«\nw\nq\nn\n\n×\nÈ\n\n\nm\n¿\n¬\nq\nn\n\ns\n×\nm\nÈ\n\nn\n\ns\np\nÁ\nÓ\nØ\np\n¼\nÁ\nm\n«\nw\nq\nn\n\nm\n\n\nÝ\nm\nß\n¬\nÐ\nÞ\nË\nq\n\nّ\n\n«\nÑ\n»\nÁ\nّ\nË\nn\n\nØ\nÝ\n\nm\nv\nÖ\nË\n¤\nÀ\n¸\nn\n×\nÓ\nØ\nm\n¿\nÀ\nّ\nÖ\n\nq\n \nn\nÏ\nÖ\n×\nv\n¬\nn\n°\n\n×\n¿\nÛ\n\n\nÞ\n\np\n \n¸\nÛ\n¸\nt\n\nm\nu\nm\n×\n(\nm\n×\nm\nÈ\n\nØ\nm\n¾\np\nÌ\nn\nÚ\n\nm\n\n\nq\n¹\nm\n×\nm\nÏ\nw\n¸\nn\n¾\nÈ\n\n\nÏ\n´\n\nm\nÈ\nË\n\n×\nÈ\n\n\nm\n¿\n´\n\n\n×\nm\nÈ\nË\n¼\nn\nÎ\n×\nm\nÈ\n«\nw\nq\nn\n\n×\nÈ\n\n\nm\n¿\n¬\nq\nn\n\ns\n×\nm\nÈ\n\nn\n\ns\np\nÁ\nÓ\nØ\np\n¼\nÁ\nm\n«\nw\nq\nn\n\nm\n\n\nÝ\nThird\n(\nm\nß\n¬\nÐ\nÞ\nË\nq\n\nّ\n\n«\nÑ\n»\nÁ\nّ\nË\nn\n\nØ\nÝ\n\nm\nv\nÖ\nË\n¤\nÀ\n¸\nn\n×\nÓ\nØ\nm\n¿\nÀ\nّ\nÖ\n\nq\n \nn\nÏ\nÖ\n×\nv\n¬\nn\n°\n\n×\n¿\nÛ\n\n\nÞ\n\np\n \n¸\nÛ\n¸\nt\n\nm\nu\nm\n×\nm\nÈ\n\nØ\nm\n¾\np\nÌ\nn\nÚ\n\nm\n\n\nq\n¹\nm\n×\nm\nÏ\nw\n¸\nn\n¾\nÈ\n\n\nÏ\n´\n\nm\nÈ\nË\n\n×\nÈ\n\n\nm\n¿\n´\n\n\n×\nm\nÈ\nË\n¼\nn\nÎ\n×\nm\nÈ\n«\nw\nq\nn\n\n×\nÈ\n\n\nm\n¿\n¬\nq\nn\n\ns\n×\nm\nÈ\n\nn\n\ns\np\nÁ\nÓ\nØ\np\n¼\nÁ\nm\n«\nw\nq\nn\n\nm\n\n\nÝ\nm\nß\n¬\nÐ\nÞ\nË\nq\n\nّ\n\n«\nÑ\n»\nÁ\nّ\nË\nn\n\nØ\nÝ\n\nm\nv\nÖ\nË\n¤\nÀ\n¸\nn\n×\nÓ\nØ\nm\n¿\nÀ\nّ\nÖ\n\nq\n \nn\nÏ\nÖ\n×\nv\n¬\nn\n°\n\n×\n¿\nÛ\n\n\nÞ\n\np\n \n¸\nÛ\n¸\nt\nm\n¿\n_

Jun 05 '18 13:06 Elzoka

Can you give me an example document?

Jun 05 '18 13:06 dbashford

this one: ftp://ftp.dante.de/tex-archive/macros/latex/exptl/mem/arabic.pdf

Jun 05 '18 13:06 Elzoka

have you tried adding the arabic support that pdftotext provides? I wasn't able to get it working locally. going to take a look soon at the PR that introduces use of pdf.js, see if that can handle arabic.

Aug 03 '18 15:08 dbashford

For what its worth I have confirmed that arabic works fine in general (can extract from .docx and included a test to confirm), the characters are just not coming out of pdftotext unless you include support for arabic.

Aug 03 '18 17:08 dbashford

textract textract copied to clipboard

Arabic characters not extracting from PDF

textract
textract copied to clipboard