textract
textract copied to clipboard
Arabic characters not extracting from PDF
It would be great if the newline at least as a new line escape character. Also, is there any support for non-latin characters (Arabic) or mixed types of characters.
Did you use preserveLineBreaks
?
Also, yes, there is support for groups of characters, if something is missing, raise another issue.
@dbashford line breaks works fine after adding preserveLineBreaks, thank you. but when converting arabic docs, the functions gives weird results
"١\np\n\nÍ\nm\n¿\nÀ\nÖ\nm\n¿\n\n\nÌ\nÑ\nm\n¿\nm\nu 1\nThis sample shows a mixture of Arabic and English. Arabic is the\nmain script. There are running heads, too.\n\bodydir apparently can be used only in the preamble of the doc-\nument. \textdir should reverse \hboxes with TLT, but for some\nreason in Aleph does not work (Omega reverses hboxes correctly).\nThis is important in lists and sections. Currently, the \bodydir is\nthat of the main language (the last one in the package options), which\nmeans the document has a fixed TLT layout.\np\n\nÍ\nm\n¿\nÀ\nÖ\nm\n¿\n\n\nÌ\nÑ\nm\n¿\nm\nu\nHello\nÈ\nÚ\nÌ\n¼\nÑ\n³\nÛ\nÖ\nm\nu\nm\n×\nm\nÈ\n\nØ\nm\n¾\np\nÌ\nn\nÚ\n\nm\n\n\nq\n¹\nm\n×\nm\nÏ\nw\n¸\nn\n¾\nÈ\n\n\nÏ\n´\n\nm\nÈ\nË\n\n×\nÈ\n\n\nm\n¿\n´\n\n\n×\nm\nÈ\nË\n¼\nn\nÎ\n×\nm\nÈ\n«\nw\nq\nn\n\n×\nÈ\n\n\nm\n¿\n¬\nq\nn\n\ns\n×\nm\nÈ\n\nn\n\ns\np\nÁ\nÓ\nØ\np\n¼\nÁ\nm\n«\nw\nq\nn\n\nm\n\n\nÝ\nm\nß\n¬\nÐ\nÞ\nË\nq\n\nّ\n\n«\nÑ\n»\nÁ\nّ\nË\nn\n\nØ\nÝ\n\nm\nv\nÖ\nË\n¤\nÀ\n¸\nn\n×\nÓ\nØ\nm\n¿\nÀ\nّ\nÖ\n\nq\n \nn\nÏ\nÖ\n×\nv\n¬\nn\n°\n\n×\n¿\nÛ\n\n\nÞ\n
\np\n
\n¸\nÛ\n¸\nt\n\nm\nu\nm\n×\n(\nm\n×\nm\nÈ\n\nØ\nm\n¾\np\nÌ\nn\nÚ\n\nm\n\n\nq\n¹\nm\n×\nm\nÏ\nw\n¸\nn\n¾\nÈ\n\n\nÏ\n´\n\nm\nÈ\nË\n\n×\nÈ\n\n\nm\n¿\n´\n\n\n×\nm\nÈ\nË\n¼\nn\nÎ\n×\nm\nÈ\n«\nw\nq\nn\n\n×\nÈ\n\n\nm\n¿\n¬\nq\nn\n\ns\n×\nm\nÈ\n\nn\n\ns\np\nÁ\nÓ\nØ\np\n¼\nÁ\nm\n«\nw\nq\nn\n\nm\n\n\nÝ\nThird\n(\nm\nß\n¬\nÐ\nÞ\nË\nq\n\nّ\n\n«\nÑ\n»\nÁ\nّ\nË\nn\n\nØ\nÝ\n\nm\nv\nÖ\nË\n¤\nÀ\n¸\nn\n×\nÓ\nØ\nm\n¿\nÀ\nّ\nÖ\n\nq\n \nn\nÏ\nÖ\n×\nv\n¬\nn\n°\n\n×\n¿\nÛ\n\n\nÞ\n
\np\n
\n¸\nÛ\n¸\nt\n\nm\nu\nm\n×\nm\nÈ\n\nØ\nm\n¾\np\nÌ\nn\nÚ\n\nm\n\n\nq\n¹\nm\n×\nm\nÏ\nw\n¸\nn\n¾\nÈ\n\n\nÏ\n´\n\nm\nÈ\nË\n\n×\nÈ\n\n\nm\n¿\n´\n\n\n×\nm\nÈ\nË\n¼\nn\nÎ\n×\nm\nÈ\n«\nw\nq\nn\n\n×\nÈ\n\n\nm\n¿\n¬\nq\nn\n\ns\n×\nm\nÈ\n\nn\n\ns\np\nÁ\nÓ\nØ\np\n¼\nÁ\nm\n«\nw\nq\nn\n\nm\n\n\nÝ\nm\nß\n¬\nÐ\nÞ\nË\nq\n\nّ\n\n«\nÑ\n»\nÁ\nّ\nË\nn\n\nØ\nÝ\n\nm\nv\nÖ\nË\n¤\nÀ\n¸\nn\n×\nÓ\nØ\nm\n¿\nÀ\nّ\nÖ\n\nq\n \nn\nÏ\nÖ\n×\nv\n¬\nn\n°\n\n×\n¿\nÛ\n\n\nÞ\n
\np\n
\n¸\nÛ\n¸\nt\nm\n¿\n_
Can you give me an example document?
this one: ftp://ftp.dante.de/tex-archive/macros/latex/exptl/mem/arabic.pdf
have you tried adding the arabic support that pdftotext provides? I wasn't able to get it working locally. going to take a look soon at the PR that introduces use of pdf.js, see if that can handle arabic.
For what its worth I have confirmed that arabic works fine in general (can extract from .docx
and included a test to confirm), the characters are just not coming out of pdftotext unless you include support for arabic.