office-text-extractor
office-text-extractor copied to clipboard
Problems with reading Arabic
Description
When reading a PDF that contains Arabic text, it can't read. It outputs a text such as ̜̺͙ͯ̀ ͳ̮ /
Library version
3.0.2
Node version
v18.17.1
Typescript version (if you are using it)
No response
I've noticed that there are problems with the PDF itself too. When I copy an Arabic text to a PDF created in Acrobat, it extracts the text but the order is mixed.
Hi,
Sorry for the late reply.
This library uses pdf-parse to parse pdf text content. You could open an issue on its repo, or try using a different pdf parsing library (maybe pdfreader?) with a custom extractor:
import { type Buffer } from 'node:buffer'
import { TextExtractor, type TextExtractionMethod } from 'office-text-extractor'
import { PdfReader } from 'pdfreader'
const parser = new PdfReader()
class PdfExtractor implements TextExtractionMethod {
mimes = ['application/pdf']
apply = async (input: Buffer): Promise<string> {
const text = await new Promise((resolve, reject) => {
parser.parseBuffer(input, (error, pdf) => {
if (error) reject(error)
resolve(item?.text ?? 'blank pdf')
})
})
return text
}
}
const extractor = new TextExtractor()
extractor.addMethod(new PdfExtractor())
const text = await extractor.extractText({ input: '...', type: '...' }
console.log(text)