office-text-extractor icon indicating copy to clipboard operation
office-text-extractor copied to clipboard

Problems with reading Arabic

Open muratulashozturk opened this issue 2 years ago • 2 comments

Description

When reading a PDF that contains Arabic text, it can't read. It outputs a text such as ͯ̀௛̜̺͙ ͳ̮   /

Library version

3.0.2

Node version

v18.17.1

Typescript version (if you are using it)

No response

muratulashozturk avatar Oct 14 '23 05:10 muratulashozturk

I've noticed that there are problems with the PDF itself too. When I copy an Arabic text to a PDF created in Acrobat, it extracts the text but the order is mixed.

muratulashozturk avatar Oct 16 '23 07:10 muratulashozturk

Hi,

Sorry for the late reply.

This library uses pdf-parse to parse pdf text content. You could open an issue on its repo, or try using a different pdf parsing library (maybe pdfreader?) with a custom extractor:

import { type Buffer } from 'node:buffer'
import { TextExtractor, type TextExtractionMethod } from 'office-text-extractor'
import { PdfReader } from 'pdfreader'

const parser = new PdfReader()

class PdfExtractor implements TextExtractionMethod {
  mimes = ['application/pdf']
  apply = async (input: Buffer): Promise<string> {
    const text = await new Promise((resolve, reject) => {
      parser.parseBuffer(input, (error, pdf) => {
        if (error) reject(error)
        resolve(item?.text ?? 'blank pdf')
      })
    })

    return text
  }
}

const extractor = new TextExtractor()
extractor.addMethod(new PdfExtractor())

const text = await extractor.extractText({ input: '...', type: '...' }
console.log(text)

gamemaker1 avatar Oct 18 '23 06:10 gamemaker1