pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

Text Extraction: first character of LTTextLine totally disappears

Open NicoLivesey opened this issue 4 years ago • 1 comments

Hi,

I am trying to extract several text blocks (using pdfquery https://github.com/jcushman/pdfquery but it's mostly dependant of pdfminer backend). Most of the extractions work well but sometimes the first character (a capital letter often) just disappear and I have been exploring the tree structures the character really does not exist in it.

I tried to solve this by myself by resizing the box of extraction or tweaking the LAParams but no success.

Here's a result example by LTTextLine:

image

[ "éplacer des produits vers", "la zone de stockage", "Accueillir une clientèle", "écharger des", "marchandises, des produits", "ncaisser le montant d'une", "vente", "rocédures d'encaissement", "roposer un service, produit", "adapté à la demande client", "éaliser la mise en rayon", "epérer et signaler les", "produits détériorés ou", "manquants", "rier et répartir les colis,", "marchandises selon les", "indications (codification,", "format, poids, nombre, ...)", "" ]

As you can see after the first block, each first character has disappear. Is it a problem you already met ?

Thank you in advance for you help !

NicoLivesey avatar Jan 12 '21 15:01 NicoLivesey

Can you share the PDF for us to investigate?

pietermarsman avatar Mar 21 '22 22:03 pietermarsman